Software Testing: How much is enough?
At a recent job interview, a colleague of mine posed the following hypothetical for the candidate. If a software bug was only reproducible five percent of the time, how many tests would be required to prove it was fixed?
Having just recently read Nate Silver’s “The Signal and the Noise“, I recognised this as a problem that Bayes’ Theorem could be used to provide some insight. As a result I decided to analyse the issue.
Bayes’ theorem is a fantastic mathematical tool that we can use to adjust our beliefs, given evidence from the world. In abstract terms, we start with a belief in how some aspect of the world works (a hypothesis). Then some events will happen which will either support or contradict our hypothesis. Given the relative probabilities of these events, Bayes’ theorem can help us compute how our belief in the hypothesis should change.
So for our example, we will assume that the software developers have analysed the issue and are proposing a solution. How often should the Quality Assurance engineer test the software so as to have a high degree of confidence that the issue is in reality fixed?
Before we run the test, we’ll need to establish a Prior Probability, which we will denote as x. This is an estimate of how likely it is that the issue still exists. Our tester figures that the issue is either resolved, or not, so decides to estimate the prior probability at 50%.
A software test is undertaken and it is passed. The engineer then needs to estimate the Likelihood (y) of a test passing, given the issue remains. In our case the bug is only reproducible 5% of the time, so the likelihood of the test passing would be 95% given the issue remains.
We also need to estimate the likelihood of the test passing, given that the issue is fixed (z). The QA engineer has supreme confidence in their abilities, so they estimate that if the issue is fixed, the test will pass 100% of the time.
Bayes’ theorem, given by the following formula, allows us to calculate a Posterior Probability, given our belief in the system, and the observed event.
This may be written as P(H|E), or the probability of the hypothesis being true, given the new evidence.
In this case, we can plug in the numbers to get:
The passing test contradicts the hypothesis that the issue still remains, so our confidence diminishes slightly. Instead of believing that issue is 50% likely, we can now update our belief to 48.72%.
Intuitively, this matches our expectations. If a software test passes it adds evidence that the issue is resolved, but we must run more tests to be sure. If subsequent tests continue to pass, then we can use the same method to continually update our belief in the hypothesis. As further tests pass, our confidence in the hypothesis will continue to diminish, and after 10 passing tests, our confidence in the bug remaining would have dropped to 37.45%
Lets also assume that we’re working for a very quality conscious organisation. We’re not going to release the software until our confidence in the bug remaining is 1% or less. How many tests would we need?
Continuing the analysis would show that under these conditions (assuming an initial prior probability of 50%) we would need 90 tests for our confidence to drop to 0.98%.
One of the aspects of using Bayes’ theorem is that our initial prior probability estimate strongly affects our analysis. As an example, if we were to assume that the QA engineer was highly suspicious of the quality of the software developers work, she might set her initial expectations of the bug remaining to 90%. If she did, then she would have to carry out 133 tests to convince herself that the issue was fixed (confidence in this case would stand at 0.97%).
Alternatively if the QA engineer had a high degree of confidence in the developers work, she might set an initial expectation of the bug remaining at 10%. In this case 47 successful tests would be required to meet the release standard.
Again, this matches our intuition. The more strongly we hold a belief, the more contrarian evidence we would need to adjust our belief. The more weakly we hold a belief, then we need less contrarian evidence to convince ourselves of the new world view.
Luckily this is a somewhat extreme example. Most software bugs are readily apparent in testing, so much fewer tests are needed to establish their resolution. If we were to use a more real world example of an issue that is 95% reproducible and an initial prior probability of 50%, then after the first test passed our confidence would stand at 4.76%, and after the second test passed at 0.25%.
So to answer the initial question of how many tests are necessary, the answer would seem to hinge on how pessimistic your QA engineers are, and how reproducible the issue is.