The Decision Making Power of Simple Probabilistic Models in IT
Let’s talk about the decision-making power of simple models based on probability theory, with focus on the IT sector. Some familiarity with IT and select terms of software engineering or IT security as well as some interest in maths are required to make this article a pleasant read.
Introduction
Birthday Coincidences
Do you remember going to school with two classmates sharing the same birthday? Many know that same birthday coincidences aren’t all that mysterious considering the “Birthday Problem”, which is a probabilistic explanation for what first appears to be a paradox. Impressively, the mathematics behind the birthday problem predicts that the probability of a shared birthday in a group of 23 random people — the size of an average school class in Switzerland — is approximately 50%.
There are many other occurrences at work and in life, however, that aren’t all that remarkable probability-wise either, and that could have been predicted and managed accordingly. What made me write this article is that many people do not seem to know of probability theory, or, if reminded about it, do not want to apply it except for explaining the birthday paradox. For some reason, explaining birthday coincidences with probability theory is comprehensible enough, while using statistics to argue pro versus contra, let alone to make decisions regarding issues in everyday life seems outlandish.
Imagine being in a call and someone arguing with a model of probability theory whether to decide on A or B. Do you remember that ever happening (with the exception of talking to a quantitative analyst perhaps)? Any regular call is more likely to go down this way:
“How is all of this theoretical statistics babble supposed to help us here? We’re not talking about theory here. This is not rocket-science. That hasn’t happened before, why should it be of concern to our decision? It happens very rarely. I don’t think that further talk on this is necessary.”
However odd it may at first appear, a statistical rule-of-thumb estimate may sometimes be truly valuable, especially when events scale non-linearly — based on my personal experience. To be clear, this does not require anyone to estimate parameters in order to make decisions. In many cases, simply understanding a behavior based on a probabilistic model is sufficient. What I am talking about is more “qualitative decision-making based on quantitative models” rather than classic “quantitative decision-making”.
Drug Combination Risks
Let me give you a pharmaceutical example of what I mean by “qualitative decision-making based on quantitative models” before we delve into partner dating (!) and IT, especially software.
A large part of the elderly population over 70 years of age is known to take at least 10 different drugs daily. Multiple drugs may negatively interact with each other in the human body. In fact, taking 10 rather than 3 different drugs is five times more likely to result in drug interaction. With every additional drug taken, there are more possible combinations for a particular drug to interfere with any other drug in the cocktail. (For the sake of simplicity, we are only considering the interaction of two drugs at a time here.)
Knowing that drug interactions increase substantially when taking more drugs of different kinds allows for better health decisions. In the case of a patient who, relatively speaking, already takes many different drugs, a doctor may decide to advise against taking yet another drug if it is of minor importance. As for another example, it may be wise not to have that one glass of wine if you are already on antiallergic drugs during hay fever season and simultaneously undergoing an antibiotics treatment. (In any case, there are additional experts for drug interactions that you or your doctor may wish to consult.) Both of these are qualitative decisions: You don’t calculate, nor do you estimate. You just know intuitively that you should keep the number of different drugs at a time as low as possible! Because with every additional drug, the “marginal costs” on your health rise.
The Chances of Getting a Date
I recall having lunch with a bunch of engineers during my studies. We would talk about dating and give each other advice as to how to succeed at getting a first date (yes, this is about getting a date in the first place, not about dates evolving into something more). One engineer noted that your strategy does not matter all that much, as you just need to make use of the law of large numbers. So to say, the strategy would indeed be one of making use of law of large numbers:
“If you ask for a date a dozen times, what are the odds of getting at least one date? Let’s assume that there is a chance of 10% that anyone will accept your invitation for a date. With a dozen attempts, you have a chance of over 70% of getting a date, which is pretty high.”
It didn’t take long for said engineer to find a date. The point is that the probability of at least one event turning out successful (or unsuccessful), out of many random events that can be either successful or not, is often surprisingly large — “even a blind chicken sometimes finds a grain of corn”, as the German saying goes. The actual likelihood of at least one thing happening, even if it is usually very unlikely, often goes against our intuition. The probability model said engineer was talking about — regardless of whether it actually works for dating — is the model that we are going to discuss in more detail in the context of IT in the following sections. From birthdays to pharmaceutical drugs and to partner dating odds, we shall finally arrive at a probability model I consider highly applicable to IT-related matters, and software in particular. I will try to show you why with different examples.
(Please do not take the application of the law of large numbers to dating too seriously.)
The Law
Let’s assume that when something happens, it can either turn out to be “good” or “bad”. We call this an “event” and assign both “good” and “bad” a probability: “bad” happens with probability p; whereas “good” happens with probability (1-p). Then, the overall probability P of at least one event turning out “bad” out of many events n is expressed in the following formula:
In summary, the parameters are as follows:
- n is the number of events taking place.
- p is the probability that an event turns out “bad”; (1-p) is the probability that an event turns out “good”.
- P is the probability that at least one event out of n events turns out “bad”.
Now, what does P look like for different values of n and p? This is illustrated in following graphic. Yellow means P is close to 0 and blue means P is close to 1; green means a value in between 0 and 1.
What should attract your attention is the large blue area: It is much more likely to find yourself in the blue rather than in the yellow area for a randomly chosen p between 0.2% and 10% and, let’s say, n>20.
Of course, this is a question of perspective. You can argue that for any p small enough, there is always a yellow area large enough in the lower left corner. Either way, what we can agree on is that for larger n, a relatively small change in p can determine whether you find yourself in the yellow or the blue area (see also the Maths section with a differential analysis). When dealing with a system with a large n and an uncertain p, you might be in trouble even if you didn’t expect it, and things can flip abruptly from “good” to “bad” due to a small change that occurs for whatever reason, or vice versa.
In the following sections, I will present various examples that stem from everyday work-life in IT. In most of these cases, the occurrence of at least one event with probability p is something “bad”, which is of course not necessarily the case depending on what this law is applied to.
Software Code Quality
Software code comes with flaws — be it typos, exceptions that were not taken into account, heuristic solutions that don’t fully reflect complexity and are thus flawed, misunderstandings or ever-changing requirements. In fact, software code contains so many flaws that we should expect any section of code to contain a certain percentage of flaws. Whether you write code from scratch, fix or refactor code, you should reckon with flaws introduced by that very code.
If there is a flaw in 1% of all lines of code, you can imagine that for larger sections of code, this can quickly become a big issue. You can easily surmise that for thousands of lines of code, the likelihood of there being at least one flaw in your software is very high. Seeing how software can consist of millions of lines of code, it is crucial to minimize flaws by keeping things simple, decoupled, tested, well-structured and reviewed. Writing clean code means nothing other than trying to reduce the probability of making flaws.
Time to tell you a short story. I once fixed an urgent production issue with a one-liner only to be reminded by the software architect that what I had written was not clean code. He insisted on fixing the issue “properly” himself. After fixing it, he made a self-approved pull request with over 200 lines of code. Note that it is indisputably bad practice to self-approve a pull request, even when you are the architect. After his better fix was deployed, I tested it for him. I found a flaw during testing that we immediately fixed. I tested it again and didn’t find any further flaws. Finally, we deployed the code to production. Soon after deployment, however, we had a production incident as another flaw related to his better fix was discovered.
Though I cannot prove that my one-liner was flawless, the reason I wrote a one-liner in the first place was that I considered the statistical risk of over 200 lines too high for an urgent production incident that needed to be deployed within a matter of days. I am, by the way, a self-proclaimed clean code evangelist, but I believe everything comes at a trade-off. If you’re a software engineer, you may have heard of or even made personal experiences with the saying “never touch a running system”. There seems to be a probabilistic reasoning for that…
Code flaws seem to appear randomly. Strictly speaking, however, code flaws are not random events in a statistical sense. Once code is written, flaws already exist in the software even though you may not be aware of them at that point in time. In practice, however, flaws are discovered randomly, that is to say, by coincidence, which is why we might be inclined to assign them a random variable. Whether or not we can apply random variables to code — and if yes, how so — can turn into a deeply theoretical or even philosophical discussion. For the purposes of this post, let’s stay in lane and not go down the rabbit hole.
The simple model of “at least one” on its own may suffice to argue why some teams end up in a deadlock of maintaining software, unable to even deliver the planned workload. At the very least, the model makes a case for favouring software with loose coupling and high cohesion. Adhering to such established architectural principles of IT can effectively mitigate the statistical risk described by the model: Software components should be independent from each other, implementation of one feature should not impact another feature, and, at the same time, duplications should be avoided when appropriate. The more things that can go wrong, the higher the overall likelihood at least something will have a negative impact.
Last but not least, this does not only apply to source code itself, but also to the deployment pipeline, dependencies (see section Vulnerabilities and Dependencies) and any kind of corporate processes, which I will try to demonstrate in following sections. The more single successes the overall success depends on, the more likely are you going to be busy with fixing, maintaining and mitigating those stumbling blocks. A deployment pipeline that depends on four massive frameworks that are all continuously updated, three available running servers and seven “special settings” introduced by eager engineers to have more “buttons to play with” is likely to be unavailable every two weeks or so. Even if it is unlikely that one twist or tweak causes trouble (something I am assured of whenever I ask whether an extra feature or two was really necessary), there are often simply too many twists and tweaks for the whole thing to work reliably.
Applications Services
Application services, especially prevalent micro services, are prone to outages. Use cases where one service calls another one and so on are especially delicate solely due to their design: The probability that at least one service is malfunctioning or unavailable is relatively high. That’s why some experts on micro services generally recommend asynchronous over synchronous communication when dealing with micro services. Once more, the simple probabilistic model can explain why we should avoid chaining too many services in a row and why we would prefer asynchronous over synchronous communication.
IoT Systems
IoT systems are more than ever on the rise. They pose well-known challenges. These include, inter alia, security, multi-tenancy, updates and configuration management. Moreover, everyone knows the quantity of devices makes IoT challenging: More devices possibly means another licensing model; more devices means disk space limits on the backend may be hit; more devices means more support; and so on. In many cases, however, we do not consider that the sheer quantity of IoT devices on its own, owing to the probabilistic scaling laws that apply, is what makes them challenging. These are laws not set by your framework vendor’s licensing model but by nothing other than statistics. In fact, if you are not careful, they may become an obstacle greater than your vendor’s licensing model that you cannot escape due to vendor lock-in.
With the use of modern frameworks, scaling IoT systems should in theory not require that much more effort than scale rollouts of updates and configuration changes. You want to update software on 10'000 instead of only 1000 devices? No problem, just select the 10'000 devices, click the “deployment button”, and 10'000 devices instead of 1000 will be updated in an instant. In practice, however, substantially more unexpected issues may have to be dealt with in the scaled-up case.
In order to explain this behavior, we have to take a step back and think about what an IoT device really is on an abstract, but simple level. It is in most cases a relatively small and affordable hardware with a CPU hosting a software. To be slightly more precise, the software part can be separated into two subparts: one part consisting of pure source code and a second part being configuration only. All in all, there are three independent parts that make up our IoT device:
- Hardware
- OS and application software
- Configuration
Those three parts can malfunction independently from each other due to various root causes. There might be malfunctioning (i.e. unintended state or behavior) without any impact, and there might be malfunctioning with a certain impact. In the following, we will be talking about the latter case.
Hardware
As a hardware layman, I would say that computing hardware is very reliable. Never have I had a laptop break before 3 years of usage. Vendors are good at mass-producing hundreds, if not thousands of devices that are all very similar, in some cases identical, down to the capacitor. However, hardware does sporadically show flaws and may suddenly become inoperable. For an IoT system, this can be of relevance. Let’s assume one such event happens for (only?) 0.01% of hardware per year. In that case, we note:
Software
What about software source code (excluding the configuration part)? When running identical software (i.e. with the same source code hash) on two devices, one tends to assume that both behave the same way. As we know about multi-threaded software exhibiting non-deterministic behavior, rather rare flaws may show up. The software may also run in an inconsistent state or not have enough RAM for some reason, which causes the same software to behave differently on different devices. There are many root causes in practice. For the purposes of this analysis, however, we will not differentiate between them. Instead, let’s bundle them together as anything rather unexpected that can occur at runtime and that does not “typically” happen with all other devices running exactly the same software (i.e. with the same source code hash). For example, a device may have run software version 1.0.0 in the past before it was updated to version 1.0.2. Unlike all other devices, however, this one skipped version 1.0.1 because it wasn’t on the network at the time 1.0.1 rolled out. The devices were not tested for this rare case, and the aforementioned device fell into an “unfavorable” state. Let’s assume any such issue happens to 0.02% of software per year. Therefore, we note:
Configuration
Configurations that are not embedded in the source code but define further properties about the behavior of the device are another source of error. It is possible for only one device or a few devices to be stuck with a flawed (i.e. unintended) configuration. Perhaps someone changed something manually for one device via SSH and forgot to undo it. Perhaps the device was not properly listed in the management platform and therefore runs an unmanaged configuration. (I think we can agree that there are often various causes for things to happen when they shouldn’t.) Let’s assume that one such issue, regardless of specific cause, occurs to 0.03% of configuration management rollouts per year. Therefore, we note:
What are the odds?
As a result, the probability that one particular device is in a malfunctioning state is:
And the probability that at least one device out of 1000 is in a malfunctioning state is:
And the probability that at least one device out of 10'000 is in a malfunctioning state is:
According to the model, issues may suddenly occur with 10'000 devices that have not shown up with 1000 devices within one year. To be clear: The point is not that you would have to expect more than ten times as many issues with 10'000 instead of 1000 devices. Instead, you would have to deal with different, rarer issues that may have never come up before. Why is this relevant? Let’s assume, by way of example, that it takes one unpatched device for hackers to enter your system. If there is even one device that couldn’t be patched due to an issue (regardless of underlying route cause), your whole network is at risk — a perfect case for the relevance of this model. Because: The risk that your whole network is compromised increases with each added device, even though all devices are “in principle” the same.
In any case, the idea here is not that you make a calculation based on your estimation of the odds for your IoT system. What’s important is that you draw a conclusion regarding best practices when developing and maintaining an IoT system, which surely includes simplicity of configuration management, minimal software dependencies, identical hardware and, as always, clean code. Ensure that your devices do not degenerate in too many variants and runtime states. Try not to offer too many variants and options, or at least ensure that there are no undefined states, that configurations are orthogonal to each other, mathematically speaking. Keep things as similar as possible, if not identical. That said, even when all these best practices are considered and applied, rare issues will still occur in cases that involve a large number of devices. If you keep the odds of unique flaws as low as possible, however, you will keep your system manageable at a reasonable cost.
Software Update Endpoints
The Problem
I once came across an interesting case where I deemed the probability model of “at least one” event happening useful in practice. This example is about software updates on a device. A software device needs to regularly check and pull updates for itself from various remote endpoints of various vendors (Microsoft, etc.). In total, there were one dozen (=n) endpoints the device needed to have access to in order to automatically update itself 24/7. The device runs in strongly secured corporate IT infrastructure. Firewall settings need to be implemented in a way that allows for the device to connect itself to other entities in the network. These firewall settings must be configured for each of the one dozen endpoints. Therefore, every time a new device is installed, somebody needs to manually whitelist those one dozen endpoints in the firewall. In practice, this means that a particular URL (e.g. https://www.example.com/test), domain name (example.com) or IP address (e.g. 192.168.0.4) has to be whitelisted for each endpoint. Human errors may happen while whitelisting those many endpoints, but, most importantly, whenever the properties associated with an endpoint (URL, domain or IP) change, firewall rules need to be adjusted.
Therefore, we need to ask: How likely is it for endpoint properties to change? I was assured it would be unlikely. Let’s take microsoft.com, a domain that must be very stable and that won’t change for the next decades. At first thought, it seems unlikely that its endpoint properties would change. However:
- Not all URLs of the one dozen URLs the device needs to connect to are comparable to microsoft.com. URLs from companies less renowned than Microsoft are more likely to change.
- The domain microsoft.com has many subdomains (i.e. domains like *.microsoft.com). Firewall rules should ideally be as restrictive as possible. Subdomains, however, are much more likely to change.
- Static IPs sometimes change, even if the URL or domain do not, be it due to restructuring within an organization or for technical and/or political reasons.
- Last but not least, the law of probabilities applies: As soon as at least one thing changes (URL, domain or IP) out of one dozen endpoints, we have a problem and have to adjust firewall rules for each customer, which is quite time consuming.
Better perform a quick analysis to check whether we are actually safe, right?
Analysis
I assume microsoft.com will be stable for the next 30 years (as the company isn’t much older than that), but what about *.azure-devices.net, which also belongs to Microsoft? I’d say 10 years is a long time. As for static IPs, I’d expect those one dozen URLs to change once in fewer than 10 years on average. Therefore, for every endpoint, either the URL and/or the static IP can be assumed to change once in 10 years. This translates to a chance of 10% per endpoint per year. Consequently, for once in 10 years we end up with:
And, for those who think 10 years wasn’t conservative enough, for once in 20 years we end up with:
The Conclusion
In practice, a result of 50% essentially means: “Yes, it can happen. We have to factor it in.” And, well, it turned out that in the last 24 months, this was indeed the case with one endpoint. This makes for a good practical example where n was known exactly, whereas p was difficult to determine.
One-Time-Password Authentication
Here comes an ideal example from the field of security engineering: one-time password authentication. Remarkably, it is an example with a precise n and p. Using OTP authentication, users who want to log in to their account are asked to enter a one-time password upon login. Many users open an app that displays their OTP, which they type or copy paste into the field. In most cases, the OTP code consists of 6 decimal digits. The odds of randomly entering the right OTP code are thus 1 in a million. What are the odds that a malicious attacker who enters a random OTP 1 million times gets it right?
For 1 million attempts (n=1'000'000) we end up with a chance of 63% to break into the organization — a chance worth pursuing. Most internet logins, however, are protected with more or less sophisticated proxy servers that prevent further login attempts after several unsuccessful attempts within a short time window. More importantly, in most cases, logins with OTP are part of two-factor authentication (2FA) login processes, meaning that a first successful login with a standard password is required before the second login step with OTP code verification even appears.
What would a more realistic scenario look like? Let’s assume malicious actors have laid hands on 100 hacked account credentials (e.g. through surveillance footage in an internet café). This means that for 1 million attempts, they would need to try to log in 10'000 times per user:
10'000 attempts in turn can be split into 100 logins per day for 100 days:
Et voilà: We need 100 stolen credentials, 100 days and 100 attempts per day and per stolen credential. It is an easy task for a software engineer to automate these attempts in a script.
It appears that this, along with replay attacks and further concerns, is one of the reasons OTP authentication is outdated. Of course, the effort and conditions required to launch such an attack are rather substantial. If you consider all that can be gained by logging in with a common employee’s credentials, without any malware (which is also money- and luck-dependent) and with minimal risk of detection, the scenario is not all that far-fetched from my point of view.
Vulnerabilities & Dependencies
My favourite section! I’m a dependency hawk; one more dependency than necessary and I feel uneasy, to say the least. As you may have guessed from my essay so far, this section is about the complexity of dependency management in terms of software maintenance, but also specifically vulnerability patching. There are two aspects I would like to mention:
- Quantity of Dependencies
The more dependencies, the more likely for one or more to cause trouble for an IT solution. More specifically, dependencies are prone to come with security vulnerabilities. Now we must ask, what is the probability that at least one of our dependencies contains a security vulnerability (with impact)? Again, the model of “at least one” indicates that we should avoid unnecessary dependencies whenever possible.
2. Nested Dependencies
To delve further into the challenges of dependencies when it comes to software, nested dependencies should be a particular eyesore to security practitioners. Nested dependencies are dependencies that contain at least one other dependency themselves. These nested dependencies can come as a construct of more than just one third-party vendor, that is to say, several vendors can be involved in a nested dependency. For example, a dependency Dₐ from a third-party company A may rely on another dependency Dᵦ from third-party company B. In case that a vulnerability is found in dependency Dᵦ, dependency Dₐ cannot be patched as long as Dᵦ hasn’t been patched! This stands in contrast to non-nested dependencies: Whenever a vulnerability (or bug) turns up, one can simply install an update of that non-nested dependency, or, if an update is not yet available, patch it in the source code oneself. This is illustrated in the following graphic. The dependency in red with a vulnerability/bug can be easily replaced without affecting the others.
In 2021, the Log4Shell Vulnerability (CVE-2021–44228) wreaked havoc for many security practitioners during Christmas time. The vulnerability was found in a widely distributed Java logger with the name Log4j. Many software solutions depended (and still depend) on the Log4j dependency. As a result, many companies were potentially vulnerable to hackers’ exploits. More importantly, various publicly downloadable third-party dependencies embedded a Log4j dependency themselves. This means that in some cases, Log4j came as a nested dependency. This in turn meant that patching was not always possible in practice, even when a patched version of the Log4j dependency itself became available! This issue is illustrated in the following graphic. The dependency in red with a vulnerability/bug can not be replaced easily. Instead, replacing the overall enclosing dependency shown in light blue might be necessary.
You had to wait until third-party libraries updated their dependencies themselves. The probability of at least one third-party supplier not being able and/or willing to deliver on time, or not being able and/or willing to patch at all, is again relatively high if you think of dozens or hundreds of dependencies. The whole dependency chain is not only fragile, but may introduce unpatchable security vulnerabilities. Therefore, we should not only avoid the sheer quantity of dependencies, but also the quantity and degree of nested dependencies.
Conclusion
Do we take risks seriously when adding one dependency after another just because we think we might need this and that? A standard library wrapper with more “fancy features”, a faster logger, a pretty-print formatter, a tool that checks another tool, one more protocol-stack, an extra script for that extra job in that extra case, or a Beta version upgrade because Beta is so much cooler — Just because something is easy and often cost-free to do in software doesn’t mean that we should do it without further thought. Every dependency adds complexity, is a risk, and thus increases overall complexity and risk. At times, such additions, all the more so if not carefully considered, come at a (high) price.
Should I really write about supply chain management and logistics without respective experience? No. With respect to the IT sector, however, I would like to underscore the criticality of dependencies in supply chain management and software. In software, dependencies sometimes come at zero cost (open source) and, more importantly, are always available in the cloud (remote repository, e.g. Github) from anywhere at any time. As a matter of course, dependencies in software are not really comparable to supply chain dependencies in the physical world. However, dependencies in software bring along massive complexity and concomitant risks. From this point of view, there is some similarity to supply chain management in the physical world. More specifically, there are cases where software dependencies cannot be easily replaced as they are tightly embedded in your solution, may depend on other (nested) dependencies and so on. Dependencies do come at a price. The question is, should we start to think a little more like those in the supply chain and logistics sector and be more risk-aware, if not -averse?
Risk Management
Hopefully well-known to risk managers, but probably less known to software engineers, project managers and pre-sales is that a large number of small probability risks should generally not be swept under the rug.
A common corporate practice is to list risks (e.g. project risks) with their associated impact and probability in a table. I once witnessed a table with a fair quantity of identified risks being simplified — negligently so — to a table containing only the most probable risks. The original table had contained risks with low probability and a few with high probability. What’s notable is that the risks all had the same impact, no matter how high or low their probability. However, according to the model, removing a dozen low probability risks can have the same effect as removing one high probability risk (with same impact).
Audits
Many think that (security) audits exist to ensure a minimum standard of security. They think that after passing the audit, they can build upon that minimum standard. This, however, is not entirely correct. An audit can never assess all of the complexity; things to be audited may have been missed, miscommunication may have resulted in wrong assessment, and so on. Instead, an audit is (only) one of many other layers needed to ensure that the solution or process actually fulfils the standards, and thus can be considered “secure” up to a certain degree of confidence. A passed audit is nothing more than an additional factor of confidence to the overall probability of confidence.
IT Security (in General)
On a high level, it is said that good IT security is like an onion layer. Every layer offers an additional protection. In theory, the more layers, the better. Again, this is a perfect example of where the model of “at least one” can be applied. At least one layer needs to stay in place (i.e. remain uncompromised or unintruded) in order to guarantee absolute protection. In this case, probability law is something beneficial for us, the “white hats”.
The CIA (Confidentiality, Integrity, Availability) triad exemplifies the pillars of security. In some cases, you might need to make a trade-off. When you have to decide how many cloud service providers you want to work with, you should note that the more service providers, the higher the availability, but also the higher the risk of a breach of confidentiality. Again, this is a case where the probabilistic model of “at least one” can be applied, even though it is hard to assign a probability for confidentiality breach to an individual cloud provider (in contrast to a cloud provider’s availability, which is well-measurable). Because of the model, we know that the risk of a confidentiality breach grows substantially with each additional cloud provider that we rely on. This knowledge helps avoid making fundamentally wrong decisions.
Remark: On an abstract level, you could argue that IT security is a game of probabilities. At this point, I want to quote what Steve Gibson, a renowned security expert, often repeats in his Podcast called Security Now: “Complexity is the enemy of security.”
A Little Maths
Here comes the fun part for particularly eager readers, though it can be safely skipped if math isn’t your thing. In this section, I’d like to show a small analysis of what happens for a slightly changing n or p (i.e. differential analysis) and, additionally, an uncertain, varying p.
Changing n
The derivative of resulting probability P with respect to n is:
For very small numbers (n≈1) and high probability (p≈1), a change in n is relevant. However, at this “operating range”, the law isn’t all that useful: It’s powerful when we have “many” (n>1) events that are “unlikely” (p≈0). Notably, when comparing the magnitude of change for n with the magnitude of change for p, change in p is much more critical. When changing n, we are usually talking about a change in magnitude such as, for example, scaling from n=1000 to n=10'000, and not from n=1000 to n=1001. A magnitude change in n does have a strong impact on the resulting probability P indeed — rather unsurprising.
Changing p
The derivative of resulting probability P with respect to p is
A change in p is definitely more interesting than one in n. More specifically, for large n and small p, any change in p can result in a massive change of the resulting probability P. In practice, this means that minimal changes in the environment of your system or in your system itself can have tremendous impact on whether you might encounter “rare events” or not.
Uncertain p
Last but not least, I’ve taken the time to simulate what happens when p is uncertain. For my simulation, I added Gaussian random noise on p (and ensured it’s well within [0,1]). Mathematically, we can clearly see that for uncertain p with Gaussian noise, we tend to “spread” the blue surface, making it even larger, especially for large n and small p. The quintessence is: Uncertainty or, more precisely, statistical variance in the random variable’s probability means even higher probability of occurrence.
In some instances, the reason to assign individual (!) random variables might be even more questionable than assigning a group of equally likely random variables in the first place. It is simply very hard to estimate the probability p for one single random variable, especially when you don’t even know how large your n is. Again, I do not want to go down the rabbit hole here, and conclude that uncertainty might need to be taken into account.
Conclusion
Decisions that support best practices in IT should, as a rule, not be based on micro-arguing as to why something could not possibly break. Too often have I heard that something could not possibly happen for this and that reason, and then it happened anyway. You can never be 100% certain. It is important to always keep the whole, abstract picture in mind: Don’t argue whether anything is absolutely reliable for one reason or another. Instead, treat components, processes, products, hardware and so on as entities whose behaviour is subject to statistical random fluctuations, whatever the root cause of such fluctuations may be.
Closing Remarks
Personally, I like to use statistics and probability theory for practical decisions, or simply to figure out the (most likely) truth. For example, Bayesian analyses are an officially recognized means to support criminal investigations, and in some cases even pinpoint the culprit beyond reasonable doubt. They are more than ever officially recognized tools by state departments and intelligence agencies. As a matter of course, the more complex the models, the more dubious they are when it comes to making accurate predictions in practice. Nevertheless, it needs to be said that some models can be (un)surprisingly robust and accurate!
In the end, I’d like to mention the SARS-Cov2 pandemic models. They were often (if not always) flawed at the beginning of the pandemic as they failed to predict case numbers, deaths and effective measures. Didn’t they predict that we would run into a problem if we didn’t take counter measures though? They failed to predict when we would run into a problem and how serious of a problem it would be (and some scientists even failed to correctly interpret the effectiveness of lockdowns on case numbers after a whole lot of data was available), but at least they predicted we would run into a problem. Of course, if you understand exponential functions, you could have drawn the same conclusion with a pencil and a one-line calculation on a match box. In that case, you may consider the consulting fees of some scientists, which can amount to hundreds of $ per hour, unjustified to tell us just that. Be that as it may, the knowledge that a problem is coming on its own may be enough to make the right decisions. That’s why I deem simple models, such as the “at least one” model, useful.
Author: Christian Dorn