The Decision Making Power of Simple Probabilistic Models in IT
Let’s talk about the decision-making power of simple models based on probability theory, with focus on the IT sector. Some familiarity with IT and select terms of software engineering or IT security as well as some interest in maths are required to make this article a pleasant read.
Do you remember going to school with two classmates sharing the same birthday? Many know that same birthday coincidences aren’t all that mysterious considering the “Birthday Problem”, which is a probabilistic explanation for what first appears to be a paradox. Impressively, the mathematics behind the birthday problem predicts that the probability of a shared birthday in a group of 23 random people — the size of an average school class in Switzerland — is approximately 50%.
There are many other occurrences at work and in life, however, that aren’t all that remarkable probability-wise either, and that could have been predicted and managed accordingly. What made me write this article is that many people do not seem to know of probability theory, or, if reminded about it, do not want to apply it except for explaining the birthday paradox. For some reason, explaining birthday coincidences with probability theory is comprehensible enough, while using statistics to argue pro versus contra, let alone to make decisions regarding issues in everyday life seems outlandish.
However odd it may at first appear, a statistical rule-of-thumb estimate may sometimes be truly valuable, especially when events scale non-linearly — based on my personal experience. To be clear, this does not require anyone to estimate parameters in order to make decisions. In many cases, simply understanding a behavior based on a probabilistic model is sufficient. What I am talking about is more “qualitative decision-making based on quantitative models” rather than classic “quantitative decision-making”.
Let me give you a pharmaceutical example of what I mean by “qualitative decision-making based on quantitative models” before we delve into partner dating (!) and IT, especially software.
A large part of the elderly population over 70 years of age is known to take at least 10 different drugs daily. Multiple drugs may negatively interact with each other in the human body. In fact, taking 10 rather than 3 different drugs is five times more likely to result in drug interaction. With every additional drug taken, there are more possible combinations for a particular drug to interfere with any other drug in the cocktail. (For the sake of simplicity, we are only considering the interaction of two drugs at a time here.)
Knowing that drug interactions increase substantially when taking more drugs of different kinds allows for better health decisions. In the case of a patient who, relatively speaking, already takes many different drugs, a doctor may decide to advise against taking yet another drug if it is of minor importance. As for another example, it may be wise not to have that one glass of wine if you are already on antiallergic drugs during hay fever season and simultaneously undergoing an antibiotics treatment. (In any case, there are additional experts for drug interactions that you or your doctor may wish to consult.) Both of these are qualitative decisions: You don’t calculate, nor do you estimate. You just know intuitively that you should keep the number of different drugs at a time as low as possible! Because with every additional drug, the “marginal costs” on your health rise.
I recall having lunch with a bunch of engineers during my studies. We would talk about dating and give each other advice as to how to succeed at getting a first date. One engineer noted that your strategy does not matter all that much, as you just need to make use of the law of large numbers. So to say, the strategy would indeed be one of making use of law of large numbers: “If you ask for a date a dozen times, what are the odds of getting at least one date? Let’s assume that there is a chance of 10% that anyone will accept your invitation for a date. With a dozen attempts, you have a chance of over 70% of getting a date, which is pretty high.”
It didn’t take long for said engineer to find a date— “even a blind chicken sometimes finds a grain of corn”, as the German saying goes. The point is that the probability of at least one event turning out successful, out of many random events that can be either successful or not, often goes against our intuition.
From birthdays to pharmaceutical drugs and to partner dating odds, we shall finally arrive at a probability model I consider highly applicable to IT-related matters, and software in particular. I will try to show you why with different examples.
Let’s assume that when something happens, it can either turn out to be “good” or “bad”. We call this an event and assign both “good” and “bad” a probability: “bad” happens with probability p; whereas “good” happens with probability (1-p). Then, the overall probability P of at least one event turning out “bad” out of n events in total is expressed in the following, well-known formula:
Now, what does P look like for different values of n and p? This is illustrated in following graphic. Yellow means P is close to 0 (which is “good”) and blue means P is close to 1 (which is “bad”); green means a value in between 0 and 1.
What should attract your attention is the large blue area: It is much more likely to find yourself in the blue rather than in the yellow area for a randomly chosen p between 0.2% and 10% and, let’s say, n>50.
In the following sections, I will present various examples that stem from everyday work-life in IT. In most of these cases, the occurrence of at least one event with probability p is something “bad”, which is of course not necessarily the case depending on what this law is applied to.
Software Code Quality
Software code comes with flaws — be it typos, exceptions that were not taken into account, heuristic solutions that don’t fully reflect complexity and are thus flawed, misunderstandings or ever-changing requirements. In fact, software code contains so many flaws that we should expect any section of code to contain a certain percentage of flaws. Whether you write code from scratch, fix or refactor code, you should reckon with flaws introduced by that very code.
If there is a flaw in 1% of all lines of code, you can imagine that for larger sections of code, this can quickly become an unmanageable issue. You can easily surmise that for thousands of lines of code, the likelihood of there being at least one flaw in your software is very high. Seeing how software can consist of millions of lines of code, it is crucial to minimize flaws by keeping things simple, decoupled, tested, well-structured and reviewed. Writing clean code means nothing other than trying to reduce the overall likelihood of making flaws, and in addition, trying to write code in such that if flaws exist, their impact is minimal, local.
Code flaws seem to appear randomly. Strictly speaking, however, code flaws are not random events in a statistical sense. Once code is written, flaws already exist in the software even though you may not be aware of them at that point in time. In practice, however, flaws are discovered randomly, that is to say, by coincidence, which is why we might be inclined to assign them a random variable. (Whether or not we can apply random variables to code — and if yes, how so — can turn into a deeply theoretical or even philosophical discussion. For the purposes of this post, let’s stay in lane and not go down the rabbit hole.)
Treating flaws (or their discovery) as a random variable, the model makes a case for favouring software with loose coupling and high cohesion: Software components should be independent from each other, implementation of one feature should not impact another feature, and, at the same time, duplications should be avoided when appropriate. The more things that can go wrong, the higher the overall likelihood at least something will have a negative impact. The simple model of “at least one” on its own may suffice to argue why some teams end up in a deadlock of maintaining software, unable to even deliver the planned workload. Suddendly relaxing requirements on code quality may just slightly increase p (i.e. the probability of flaws). However, the impact could be detrimental for high n (large number of coupled code) according to the model “at least one”, it could make the difference between a software that feels “good” and a software that feels like theres always something popping up that isn’t working.
Application services, especially prevalent micro services, are prone to outages. Use cases where one service calls another one and so on are especially delicate solely due to their design: The probability that at least one service is malfunctioning or unavailable is relatively high. That’s why some experts on micro services generally recommend asynchronous over synchronous communication when dealing with micro services. Once more, the simple probabilistic model can explain why we should avoid chaining too many services in a row and why we would prefer asynchronous over synchronous communication.
IoT systems are more than ever on the rise. They pose well-known challenges. These include, inter alia, security, multi-tenancy, updates and configuration management. Moreover, everyone knows the quantity of devices makes IoT challenging: More devices possibly means another licensing model; more devices means disk space limits on the backend may be hit; more devices means more support; and so on. In many cases, however, we do not consider that the sheer quantity of IoT devices on its own, owing to the probabilistic scaling laws that apply, is what makes them challenging. These are laws not set by your framework vendor’s licensing model but by nothing other than statistics. In fact, if you are not careful, they may become an obstacle greater than your vendor’s licensing model that you cannot escape due to vendor lock-in.
With the use of modern frameworks, scaling IoT systems should in theory not require that much more effort than scale rollouts of updates and configuration changes. You want to update software on 10'000 instead of only 1000 devices? No problem, just select the 10'000 devices, click the “deployment button”, and 10'000 devices instead of 1000 will be updated in an instant. In practice, however, substantially more unexpected issues may have to be dealt with in the scaled-up case.
In order to explain this behavior, we have to take a step back and think about what an IoT device really is on an abstract, but simple level. It is in most cases a relatively small and affordable hardware with a CPU hosting a software. To be slightly more precise, the software part can be separated into two subparts: one part consisting of pure source code and a second part being configuration only. All in all, there are three independent parts that make up our IoT device:
- OS and application software
Those three parts can malfunction independently from each other due to various root causes. There might be malfunctioning (i.e. unintended state or behavior) without any impact, and there might be malfunctioning with a certain impact. In the following, we will be talking about the latter case.
As a hardware layman, I would say that computing hardware is very reliable. Never have I had a laptop break before 3 years of usage. Vendors are good at mass-producing hundreds, if not thousands of devices that are all very similar, in some cases identical, down to the capacitor. However, hardware does sporadically show flaws and may suddenly become inoperable. For an IoT system, this can be of relevance. Let’s assume one such event happens for (only?) 0.01% of hardware per year. In that case, we note:
What about software source code (excluding the configuration part)? When running identical software (i.e. with the same source code hash) on two devices, one tends to assume that both behave the same way. As we know about multi-threaded software exhibiting non-deterministic behavior, rather rare flaws may show up. The software may also run in an inconsistent state or not have enough RAM for some reason, which causes the same software to behave differently on different devices. There are many root causes in practice. For the purposes of this analysis, however, we will not differentiate between them. Instead, let’s bundle them together as anything rather unexpected that can occur at runtime and that does not “typically” happen with all other devices running exactly the same software (i.e. with the same source code hash). For example, a device may have run software version 1.0.0 in the past before it was updated to version 1.0.2. Unlike all other devices, however, this one skipped version 1.0.1 because it wasn’t on the network at the time 1.0.1 rolled out. The devices were not tested for this rare case, and the aforementioned device fell into an “unfavorable” state. Let’s assume any such issue happens to 0.02% of software per year. Therefore, we note:
Configurations that are not embedded in the source code but define further properties about the behavior of the device are another source of error. It is possible for only one device or a few devices to be stuck with a flawed (i.e. unintended) configuration. Perhaps someone changed something manually for one device via SSH and forgot to undo it. Perhaps the device was not properly listed in the management platform and therefore runs an unmanaged configuration. (I think we can agree that there are often various causes for things to happen when they shouldn’t.) Let’s assume that one such issue, regardless of specific cause, occurs to 0.03% of configuration management rollouts per year. Therefore, we note:
What are the odds?
As a result, the probability that one particular device is in a malfunctioning state is:
And the probability that at least one device out of 1000 is in a malfunctioning state is:
And the probability that at least one device out of 10'000 is in a malfunctioning state is:
According to the model, issues may suddenly occur with 10'000 devices that have not shown up with 1000 devices within one year. To be clear: The point is not that you would have to expect more than ten times as many issues with 10'000 instead of 1000 devices. Instead, you would have to deal with different, rarer issues that may have never come up before. Why is this relevant? Let’s assume, by way of example, that it takes one unpatched device for hackers to enter your system. If there is even one device that couldn’t be patched due to an issue (regardless of underlying route cause), your whole network is at risk — a perfect case for the relevance of this model. Because: The risk that your whole network is compromised increases with each added device, even though all devices are “in principle” the same.
In any case, the idea here is not that you make a calculation based on your estimation of the odds for your IoT system. What’s important is that you draw a conclusion regarding best practices when developing and maintaining an IoT system, which surely includes simplicity of configuration management, minimal software dependencies, identical hardware and, as always, clean code. Ensure that your devices do not degenerate in too many variants and runtime states. Try not to offer too many variants and options, or at least ensure that there are no undefined states, that configurations are orthogonal to each other, mathematically speaking. Keep things as similar as possible, if not identical. That said, even when all these best practices are considered and applied, rare issues will still occur in cases that involve a large number of devices. If you keep the odds of unique flaws as low as possible, however, you will keep your system manageable at a reasonable cost.
Here comes an ideal example from the field of security engineering: one-time password authentication. Remarkably, it is an example with a precise n and p. Using OTP authentication, users who want to log in to their account are asked to enter a one-time password upon login. Many users open an app that displays their OTP, which they type or copy paste into the field. In most cases, the OTP code consists of 6 decimal digits. The odds of randomly entering the right OTP code are thus 1 in a million. What are the odds that a malicious attacker who enters a random OTP 1 million times gets it right?
For 1 million attempts (n=1'000'000) we end up with a chance of 63% to break into the organization — a chance worth pursuing. Most internet logins, however, are protected with more or less sophisticated proxy servers that prevent further login attempts after several unsuccessful attempts within a short time window. More importantly, in most cases, logins with OTP are part of two-factor authentication (2FA) login processes, meaning that a first successful login with a standard password is required before the second login step with OTP code verification even appears.
What would a more realistic scenario look like? Let’s assume malicious actors have laid hands on 100 hacked account credentials (e.g. through surveillance footage in an internet café). This means that for 1 million attempts, they would need to try to log in 10'000 times per user:
10'000 attempts in turn can be split into 100 logins per day for 100 days:
Et voilà: We need 100 stolen credentials, 100 days and 100 attempts per day and per stolen credential. It is an easy task for a software engineer to automate these attempts in a script.
It appears that this, along with replay attacks and further concerns, is one of the reasons OTP authentication is outdated. Of course, the effort and conditions required to launch such an attack are rather substantial. If you consider all that can be gained by logging in with a common employee’s credentials, without any malware (which is also money- and luck-dependent) and with minimal risk of detection, the scenario is not all that far-fetched from my point of view.
Vulnerabilities & Dependencies
My favourite section! I’m a dependency hawk; one more dependency than necessary and I feel uneasy, to say the least. As you may have guessed from my essay so far, this section is about the complexity of dependency management in terms of software maintenance, but also specifically vulnerability patching. There are two aspects I would like to mention:
- Quantity of Dependencies
The more dependencies, the more likely for one or more to cause trouble for an IT solution. More specifically, dependencies are prone to come with security vulnerabilities. Now we must ask, what is the probability that at least one of our dependencies contains a security vulnerability (with impact)? Again, the model of “at least one” indicates that we should avoid unnecessary dependencies whenever possible.
2. Nested Dependencies
To delve further into the challenges of dependencies when it comes to software, nested dependencies should be a particular eyesore to security practitioners. Nested dependencies are dependencies that contain at least one other dependency themselves. These nested dependencies can come as a construct of more than just one third-party vendor, that is to say, several vendors can be involved in a nested dependency. For example, a dependency Dₐ from a third-party company A may rely on another dependency Dᵦ from third-party company B. In case that a vulnerability is found in dependency Dᵦ, dependency Dₐ cannot be patched as long as Dᵦ hasn’t been patched! This stands in contrast to non-nested dependencies: Whenever a vulnerability (or bug) turns up, one can simply install an update of that non-nested dependency, or, if an update is not yet available, patch it in the source code oneself. This is illustrated in the following graphic. The dependency in red with a vulnerability/bug can be easily replaced without affecting the others.
In 2021, the Log4Shell Vulnerability (CVE-2021–44228) wreaked havoc for many security practitioners during Christmas time. The vulnerability was found in a widely distributed Java logger with the name Log4j. Many software solutions depended (and still depend) on the Log4j dependency. As a result, many companies were potentially vulnerable to hackers’ exploits. More importantly, various publicly downloadable third-party dependencies embedded a Log4j dependency themselves. This means that in some cases, Log4j came as a nested dependency. This in turn meant that patching was not always possible in practice, even when a patched version of the Log4j dependency itself became available! This issue is illustrated in the following graphic. The dependency in red with a vulnerability/bug can not be replaced easily. Instead, replacing the overall enclosing dependency shown in light blue might be necessary.
You had to wait until third-party libraries updated their dependencies themselves. The probability of at least one third-party supplier not being able and/or willing to deliver on time, or not being able and/or willing to patch at all, is again relatively high if you think of dozens or hundreds of dependencies. The whole dependency chain is not only fragile, but may introduce unpatchable security vulnerabilities. Therefore, we should not only avoid the sheer quantity of dependencies, but also the quantity and degree of nested dependencies.
Do we take risks seriously when adding one dependency after another just because we think we might need this and that? A standard library wrapper with more “fancy features”, a faster logger, a pretty-print formatter, a tool that checks another tool, one more protocol-stack, an extra script for that extra job in that extra case, or a Beta version upgrade because Beta is so much cooler — Just because something is easy and often cost-free to do in software doesn’t mean that we should do it without further thought. Every dependency adds complexity, is a risk, and thus increases overall complexity and risk. At times, such additions, all the more so if not carefully considered, come at a (high) price.
Should I really write about supply chain management and logistics without respective experience? No. With respect to the IT sector, however, I would like to underscore the criticality of dependencies in supply chain management and software. In software, dependencies sometimes come at zero cost (open source) and, more importantly, are always available in the cloud (remote repository, e.g. Github) from anywhere at any time. As a matter of course, dependencies in software are not really comparable to supply chain dependencies in the physical world. However, dependencies in software bring along massive complexity and concomitant risks. From this point of view, there is some similarity to supply chain management in the physical world. More specifically, there are cases where software dependencies cannot be easily replaced as they are tightly embedded in your solution, may depend on other (nested) dependencies and so on. Dependencies do come at a price. The question is, should we start to think a little more like those in the supply chain and logistics sector and be more risk-aware, if not -averse?
On a high level, it is said that good IT security is like an onion layer. Every layer offers an additional protection. In theory, the more layers, the better. Again, this is a perfect example of where the model of “at least one” can be applied. At least one layer needs to stay in place (i.e. remain uncompromised or unintruded) in order to guarantee absolute protection. In this case, probability law is something beneficial for us, the “white hats”.
The CIA (Confidentiality, Integrity, Availability) triad exemplifies the pillars of security. In some cases, you might need to make a trade-off. When you have to decide how many cloud service providers you want to work with, you should note that the more service providers, the higher the availability, but also the higher the risk of a breach of confidentiality. Again, this is a case where the probabilistic model of “at least one” can be applied, even though it is hard to assign a probability for confidentiality breach to an individual cloud provider (in contrast to a cloud provider’s availability, which is well-measurable). Because of the model, we know that the risk of a confidentiality breach grows substantially with each additional cloud provider that we rely on. This knowledge helps avoid making fundamentally wrong decisions.
A Little Maths
Here comes the fun part for particularly eager readers, though it can be safely skipped if math isn’t your thing. In this section, I’d like to show a small analysis of what happens for a changing n (case 1) or p (case 2) and, additionally, an uncertain p (case 3).
Case 1: Changing n
The derivative of resulting probability P with respect to n is:
For very small numbers (n≈1) and high probability (p≈1), a change in n is relevant. However, at this “operating range”, the law isn’t all that useful: It’s powerful when we have “many” (n>1) events that are “unlikely” (p≈0). Notably, when comparing the magnitude of change for n with the magnitude of change for p, change in p is much more critical. When changing n, we are usually talking about a change in magnitude such as, for example, scaling from n=1000 to n=10'000, and not from n=1000 to n=1001. A magnitude change in n does have a strong impact on the resulting probability P indeed — rather unsurprising.
Case 2: Changing p
The derivative of resulting probability P with respect to p is
For large n and small p, any change in p can result in a massive change of the resulting probability P. In practice, this means that minimal changes in the environment of your system or in your system itself can have tremendous impact on whether you might encounter “rare events” or not. Especially, when dealing with a system with a large n and an uncertain p, you might be in trouble even if you didn’t expect it, and things can flip abruptly from “good” to “bad” due to a small change that occurs for whatever reason.
Case 3: Uncertain p
Last but not least, I’ve taken the time to simulate what happens when p is uncertain. This means nothing other than this model assumes there are various factors who all have a p that is slightly different from each other. For the simulation, I added Gaussian random noise on p (and ensured it’s well within [0,1]). Mathematically, we can clearly see that for uncertain p with Gaussian noise, we tend to “spread” the blue surface to the left, making it larger, especially for large n and small p the effect is significant. The quintessence is: Additional uncertainty in the random variable’s probability itself means an even stronger impact of the “at least one” model’s total probability P.
Decisions that support best practices in IT should, as a rule, not be based on micro-arguing as to why something could not possibly break. Too often have I heard that something could not possibly happen for this and that reason, and then it happened anyway. You usually can’t be 100% certain. It is advisable to keep the whole, abstract picture in mind: Treat components, processes, products, hardware and so on as entities whose behaviour is subject to statistical random fluctuations, whatever the root cause of such fluctuations may be.
Bayesian analyses are a recognized means to support criminal investigations, and in some cases even pinpoint the culprit beyond reasonable doubt. They are more than ever officially recognized tools by state departments and intelligence agencies. It needs to be said that some models have been proven to be surprisingly robust and accurate. I’d also like to mention the SARS-Cov2 pandemic models. They turned out to be often (if not always) flawed as they failed to predict case numbers, deaths and effective measures. However, didn’t they predict that we would run into a problem if we didn’t take counter measures? They failed to predict when we would run into a problem and how serious of a problem it would be and some scientists even failed to correctly interpret the effectiveness of lockdowns on case numbers after a whole lot of data was available, but at least they predicted we would run into a problem. Of course, if you understand exponential functions, you could have drawn the same conclusion with a pencil and a one-line calculation on a match box. In that case, you may consider the consulting fees of some experts, which can amount to hundreds of $ per hour, unjustified to tell us just that. Similarly, you won’t need a financial expert to tell you just that the U.S. total debt to GDP ratio isn’t sustainable at all, and that some day the whole thing will blow up. Be that as it may, the knowledge that a problem is coming on its own may be enough to make the right decisions. That’s why I deem simple models, such as the “at least one” model, useful.
Author: Christian Dorn