in reply to
Re: Software Design Resources
in thread Software Design Resources
Within a month it was shown that all of the test case produced by programmers/testers targeting their efforts according to their (or their superiors) best judgment, had covered less than 15% of the total APIs with 10% having been duplicated over and over, 5% of the possible parameter combinations, and found less than 1% of the possible bugs.
According to the limited description, the repeated 10% "duplication" was probably due to "best judgment" bias. If random sampling is used instead, it's almost impossible to have such a duplication across testers and overtime (though you still will have duplications "locally").
"Best judgement" bias is like trying to estimate the total number of Perl programmers in the world by asking people what language they use in Perl websites alone.
The "1% of the possible bugs" estimate was possibly derived from something like the catchandrelease method I mentioned in my another reply in this thread.
The the unprobabilistic approach of "best judgement" and the probabilistic estimate of "1% of the possible bugs" seem strange to occur together, however.
* * *
* * *
In case "probability" may sound like a voodoo, consider this: In a population of 100, if all are ages 25, what sample size do you need to come up with a 100% confidence estimate of the population average age? One, of course.
If 90 of them are of age 30 and 10 age 20 (average 29), a random sample of size one give you an "average" 30 90% of the time. Pretty good "estimate" actually, considering you don't have to ask all 100 of them.
The worst case (in term of sample size needed) is a 50/50 split.
So, a population of one million all aged 25 only need a sample size one to get a good (perfect in this case) estimate, whereas a population of 30, 10 aged 30, 10 aged 20, 10 aged 5 needs a larger sample size.
The moral: the quality of a statistical estimate is affected by the heterogeneity of the population, not its size. It's very counterintuitive to many people, I know.
Re: Re: Re: Software Design Resources by BrowserUk (Pope) on Aug 22, 2003 at 08:42 UTC 
Sorry. I don't think that I made that bit very clear. The statistics I gave were for the coverage achieved by the teams of test case writers prior to the introduction of the Random Testcase Generator. That is to say, there where a bunch of coders charged with the task of sitting down and writing programs to exercise given subsets of the APIs. They used thier 'best judgement' to write the programs such that they covered the edge cases of each individual function an combination of functions.
The 15% was a purely mathematical count of the APIs exercised derived by simply grep'ing and counting them from the assembled test suit.
The 10% duplication meant that of the 15% that had actually been exercised, two thirds of them had been exercised in more than one testcase. For some parts of the API set this is enevitable. You can't do anything when testing a GUI API set without having called CreateWindow() for example, but this did not explain all the duplication.
Much of it came down to the fact that given any two programmers with similar experience, their best judgement, based on their prior experiences, will lead them to similar conclusions about what needs testing. Hence, they will tend towards testing similar things. Even though they are each assigned a particular set of API's to test, it's enevitable that there will be some overlap. Given a team of nearly 100 programmers from different backgrounds, you would think that their ranges of experience would lead to a fairly wide coverage, but it doesn't happen that way. They all tend to concentrate their efforts on similar clusters of "suspect" APIs. Worse, they all tend to assume that some APIs are not necessary to test, for similar reasons.
As for the 1% of possible bugs. The bit that I consider to be tantamount to voodoo, is the determination of the number of possible bugs. In order to state that "only 1 had been found", it is necessary to know how many were found and how many could have been found. How do you begin to determine how many there could be?
I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly. This is fairly simple population sampling, standard deviation stuff. You only need to know that the sample is a truely random selection from the total population. You don't need to know the total population size.
But to conclude that 1% of possible bugs had been discovered by a set of testcases that the previous 2 statistics went soley to prove that their generation was anything but random, from the determanistic count of those that had been found, means that they had to have determined, or at least estimated to some degree of accuracy, the total possible bug count.
I have a good degree of faith in the guys doing the work, and I was treated to nearly four hours of explanation of the methodlogy involved. The program that produced that statistic ran on a top of the range S370 quad processor system and consumed prodigious amounts of cputime. The datasets were not very large.
It involved a iterative process of refining a complex polynomial with an infinite number of terms, until it approximated the discovery rates and coverage that had been determined by counting. Once the polynimial in question had been refined until it closely approximated the realworld statistics it was developed to model, it was then iterated forward into the future to project to a point where no more bugs would be discovered. In real time this would have amounted to decades or maybe centuries. Once that point was reached, they then had the estimate of the number of bugs that could be discovered and it was this figure that was used to calculate the 1% figure.
Beleive me. This went way beyond graduate level statistics with which I was familar with at that time, though I have since forgotten much of it.
I'm going to stick to my guns and say that this was the deepest statistical voodoo that I have any wish to know about:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
 [reply] 

I think I'm the one who's not being clear, not you.
I suppose we're both talking about the "best judgement" introducing bias.
I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly.
Perhaps the fuzziness of human language get in the way here. Any estimate is to estmate how many could be found, never ever how many will be found. To see that, I'll use the catchandrelease example.
Suppose the total number (the actual T) of unknown bugs is actually 100. Tester One was assigned with 20 (A) test cases; Tester Two 20 (B) also. 2 (C) bugs in common were found. The estimate is 200 total (possible) bugs (notice the large margin of error). Does it mean you will find 200 bugs given infinite time? Of course not, since we already know that there're 100 actual bugs. The estimate is 200, nevertheless. 200 is the possible total bugs you could find, based on actual available counts at the moment.
The technique and the skillset will affect the accurary of an estimate but the principle is still the same.
* * *
* * *
One side note, not to critique their method, just to provide complementing information, one should be careful when using a polynimial to fit data. Polynimial can fit any mathematical functions, given enough degrees (it's a theorem). Similarly it can fit any data, include white noise.
Consider you're testing the response time of your server in response to various levels of workload. You try a linear fit (a straight line) and polynomial of degree two (a+bx+cx^2). The polynomial fits the data better and you have the following.
X X
. . X . *
. . X . *
X . *
.X .
.X . .
X
.: data points
X: fitted to actual data
*: prediction, extrapolation
But it doesn't fit into the common sense (response time improves as the workload increases). This kind of error is very hard to detect in higher dimension, especially when you don't actually know what to expect.
The moral: A more complicated model does not always improve your prediction; it could even worsen it in some cases.
 [reply] 

Fair enough:) I don't have the math to argue with you on this.
However, I would also not take it upon myself to argue with a certain IBM statastician whos work was the basis of at least some (I think, fairly major) elements of the statistics used in the process I am describing.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
 [reply] 


 
Sorry for the second post, but I thought about this some more and I wanted to get your reaction to those thoughts. If I just modified the last post you may not have seen the update.
The estimate is 200 total (possible) bugs (notice the large margin of error). (and the rest of that paragraph)
I am under the strong, and I believe wellfounded, impression that in order for your probability calculation to make sense, the sample(s) used to estimate the total population are required to be random samples. This would not be the case if the testcases the programmers produce are done on the basis of experience (or best judgement).
If programmers A & B both write 20 identical test cases, which is unlikely, but not statistically impossible, then counting them as unique invalidates the statistics.
If the testcases they produce only cover 1% of the possible test cases and detect 2 bugs, there is no way to project the total number of bugs from that unless they represent a statistically valid sample from the total set of possible testcases. The only way for them to be a statistically valid sample is if they are a random selection from the total set of possibles. If they were written on the basis of best judgement they are not random.
Thats why the RTG was necessary for the approach I described.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
 [reply] 


That sounds interesting: it seems like they did EM to estimate the coefficients of an infiniteterm polynomial (hmm, it couldn't *really* have been infinite, so either the polynomial converged or they just took many millions of terms). Once they believed they had modeled the rate of bugdiscovery as a function of time they just solved the equation for bugdiscovery = 0. As you imply, actually finding the roots of a polynomial that large would be too challenging, so they just scanned forward in time until they found the first (+ve) root.
Just guessing, but that seems a reasonable approach to do what you're describing. Does that sound about right?
 [reply] 

 [reply] 

