http://www.perlmonks.org?node_id=285718


in reply to Re: Re: Software Design Resources
in thread Software Design Resources

Sorry. I don't think that I made that bit very clear. The statistics I gave were for the coverage achieved by the teams of test case writers prior to the introduction of the Random Testcase Generator. That is to say, there where a bunch of coders charged with the task of sitting down and writing programs to exercise given subsets of the APIs. They used thier 'best judgement' to write the programs such that they covered the edge cases of each individual function an combination of functions.

The 15% was a purely mathematical count of the APIs exercised derived by simply grep'ing and counting them from the assembled test suit.

The 10% duplication meant that of the 15% that had actually been exercised, two thirds of them had been exercised in more than one testcase. For some parts of the API set this is enevitable. You can't do anything when testing a GUI API set without having called CreateWindow() for example, but this did not explain all the duplication.

Much of it came down to the fact that given any two programmers with similar experience, their best judgement, based on their prior experiences, will lead them to similar conclusions about what needs testing. Hence, they will tend towards testing similar things. Even though they are each assigned a particular set of API's to test, it's enevitable that there will be some overlap. Given a team of nearly 100 programmers from different backgrounds, you would think that their ranges of experience would lead to a fairly wide coverage, but it doesn't happen that way. They all tend to concentrate their efforts on similar clusters of "suspect" APIs. Worse, they all tend to assume that some APIs are not necessary to test, for similar reasons.

As for the 1% of possible bugs. The bit that I consider to be tantamount to voodoo, is the determination of the number of possible bugs. In order to state that "only 1 had been found", it is necessary to know how many were found and how many could have been found. How do you begin to determine how many there could be?

I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly. This is fairly simple population sampling, standard deviation stuff. You only need to know that the sample is a truely random selection from the total population. You don't need to know the total population size.

But to conclude that 1% of possible bugs had been discovered by a set of testcases that the previous 2 statistics went soley to prove that their generation was anything but random, from the determanistic count of those that had been found, means that they had to have determined, or at least estimated to some degree of accuracy, the total possible bug count.

I have a good degree of faith in the guys doing the work, and I was treated to nearly four hours of explanation of the methodlogy involved. The program that produced that statistic ran on a top of the range S-370 quad processor system and consumed prodigious amounts of cpu-time. The datasets were not very large.

It involved a iterative process of refining a complex polynomial with an infinite number of terms, until it approximated the discovery rates and coverage that had been determined by counting. Once the polynimial in question had been refined until it closely approximated the real-world statistics it was developed to model, it was then iterated forward into the future to project to a point where no more bugs would be discovered. In real time this would have amounted to decades or maybe centuries. Once that point was reached, they then had the estimate of the number of bugs that could be discovered and it was this figure that was used to calculate the 1% figure.

Beleive me. This went way beyond graduate level statistics with which I was familar with at that time, though I have since forgotten much of it.

I'm going to stick to my guns and say that this was the deepest statistical voodoo that I have any wish to know about:)


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Replies are listed 'Best First'.
Re: Re: Re: Re: Software Design Resources
by Anonymous Monk on Aug 24, 2003 at 07:40 UTC

    That sounds interesting: it seems like they did EM to estimate the coefficients of an infinite-term polynomial (hmm, it couldn't *really* have been infinite, so either the polynomial converged or they just took many millions of terms). Once they believed they had modeled the rate of bug-discovery as a function of time they just solved the equation for bug-discovery = 0. As you imply, actually finding the roots of a polynomial that large would be too challenging, so they just scanned forward in time until they found the first (+ve) root.

    Just guessing, but that seems a reasonable approach to do what you're describing. Does that sound about right?

      To the level that I understand (and remember), that seems a perfect decription:)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Re: Re: Re: Software Design Resources
by chunlou (Curate) on Aug 22, 2003 at 09:53 UTC

    I think I'm the one who's not being clear, not you.

    I suppose we're both talking about the "best judgement" introducing bias.

    I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly.

    Perhaps the fuzziness of human language get in the way here. Any estimate is to estmate how many could be found, never ever how many will be found. To see that, I'll use the catch-and-release example.

    Suppose the total number (the actual T) of unknown bugs is actually 100. Tester One was assigned with 20 (A) test cases; Tester Two 20 (B) also. 2 (C) bugs in common were found. The estimate is 200 total (possible) bugs (notice the large margin of error). Does it mean you will find 200 bugs given infinite time? Of course not, since we already know that there're 100 actual bugs. The estimate is 200, nevertheless. 200 is the possible total bugs you could find, based on actual available counts at the moment.

    The technique and the skillset will affect the accurary of an estimate but the principle is still the same.

    *     *     *     *     *     *

    One side note, not to critique their method, just to provide complementing information, one should be careful when using a polynimial to fit data. Polynimial can fit any mathematical functions, given enough degrees (it's a theorem). Similarly it can fit any data, include white noise.

    Consider you're testing the response time of your server in response to various levels of workload. You try a linear fit (a straight line) and polynomial of degree two (a+bx+cx^2). The polynomial fits the data better and you have the following.

    
                X X
       .  .  X  .    * 
    . .   X    .       * 
        X  .             *
     .X  .
    .X .  .
    X
    
    .: data points
    X: fitted to actual data
    *: prediction, extrapolation
    

    But it doesn't fit into the common sense (response time improves as the workload increases). This kind of error is very hard to detect in higher dimension, especially when you don't actually know what to expect.

    The moral: A more complicated model does not always improve your prediction; it could even worsen it in some cases.

      Fair enough:) I don't have the math to argue with you on this.

      However, I would also not take it upon myself to argue with a certain IBM statastician whos work was the basis of at least some (I think, fairly major) elements of the statistics used in the process I am describing.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        It's not about argument. Just mutual learning.

        Funny enough, when that "statistician" (I think he's more known as a great mathematician) first came up with some very genuine estimation technique, the other engineers were very skeptical since they didn't know what he's doing. But the stuff worked. (He mentioned it on a TV documentary. Didn't say what actually that technique was.)

        As your sig says "Examine what is said, not who speaks." I don't put faith in someone just because he has a PhD. In the business world, many PhDs gave dreadful advices. (A consultant (PhD, who could give you a four-hour lecture on anything) advised a web development house that they should lay off most of their programmers and sales reps partly because many of them "not working hard enough." The firm eventually failed not because people not working hard enough but partly because the business model (the consultant partly responsible for) wasn't working.)

      Sorry for the second post, but I thought about this some more and I wanted to get your reaction to those thoughts. If I just modified the last post you may not have seen the update.

      The estimate is 200 total (possible) bugs (notice the large margin of error). (and the rest of that paragraph)

      I am under the strong, and I believe well-founded, impression that in order for your probability calculation to make sense, the sample(s) used to estimate the total population are required to be random samples. This would not be the case if the testcases the programmers produce are done on the basis of experience (or best judgement).

      If programmers A & B both write 20 identical test cases, which is unlikely, but not statistically impossible, then counting them as unique invalidates the statistics.

      If the testcases they produce only cover 1% of the possible test cases and detect 2 bugs, there is no way to project the total number of bugs from that unless they represent a statistically valid sample from the total set of possible testcases. The only way for them to be a statistically valid sample is if they are a random selection from the total set of possibles. If they were written on the basis of best judgement they are not random.

      Thats why the RTG was necessary for the approach I described.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        ...in order for your probability calculation to make sense, the sample(s) used to estimate the total population are required to be random samples.

        That's correct.

        ... the testcases the programmers produce are done on the basis of experience (or best judgement)... the testcases they produce only cover 1% of the possible test cases and detect 2 bugs, there is no way to project the total number of bugs from that.

        If I gave you the wrong impression that the "best judgement" sample constituted a random sample based on which the number of possible bugs were estimated, my bad. I think we both know how random sampling works.

        I don't know what technique they actually used (if I did, I would be a psychic and it would be voodoo), so I can't explain how their stuff works. But I can tell how such estimation of possible bugs possible.

        One simple possibility is to use regression (i.e. a model-based estimation), as illustrated below.

             |     .      
             |   .  /  . .
        no.  |   . / . . 
        of   | . ./ .
        bugs | . /  .
             |  / .
             +----------------
              no. of testcases
        

        If there's correlation (not necessarily linear) between number of testcases (independent variable) and number of bugs (dependent variable), we could use regression to estimate the total number of possible bugs assuming the total number of testcases are known and bound.

        There's no limit what and how many independent variables you may use, nor what model.

        Speaking of voodoo, in time series (since you indirectly mentioned Mandelbrot which made me think of fractal made me think of time series), you can do a bispectrum test to test if a series is linearily predictable or not without knowing what kind of process that generated the series. Pretty cool "voodoo." It's like saying I don't know where Homer came from but I'm sure he's blind.

        And financial time series often almost follow a random walk process which sometimes result in a "long memory" process. That is, the underlining process is scale-independent. In other words, if x(t) = a x(t-1) + e, where e = random noise, you get (more or less) the same "a" regardless the unit of measurement, be it daily, weekly, etc. Hence, the process is self-similar (statistically). Hence, it's a "fractal"!

        Since a random variable (such as number of bugs) or better yet a random/stochastic process could be a special case of fractal, that's where Mandelbrot (the "statistician") could come in.

        *     *     *     *     *     *

        Since I mentioned correlation, I might as well point out, what I didn't mention in the previous discussion of bugs estimation was "margin of error" (heard on TV often) or variation or variance (didn't want to confuse people with too many new concepts).

        If two random variables (say, numbers of bugs found by two testers--the number of bugs itself could be treated as random variable, even if the testcases are not randomly selected) are correlated, a positive correlation will lead to higher variance, whereas negative lower. The intuition goes like this: negative correlation leads to cancelation; hence less variance (10 + -10 = 0), while positive correlation is like things tend to come all at once; hence higher variance (10 + 10 = 20).

        Since bugs tend to have positive correlation (not due to sampling), a simple random sampling estimate based upon independence assumption underestimate the variance, "margin of error" or the severity of the bugs situation.

        *     *     *     *     *     *

        That leads us to talk about bugs (more precisely, number of bugs) as random variable/process. You can consider the "randomness" is a result of 1) random sampling or 2) the underlining process that generates those bugs.

        Bugs as random variable due to random sample we have talked about. Bugs as random process is a new topic, which I suppose was what your people were doing back then back there.

        I mentioned time series (a random process) and fractal and Mandelbrot. Since bugs could be a random process could be a time series could be a "fractal," it wouldn't be hard for Mandelbrot to figure out that the total possible bugs could be related to the upper bound of a time series. (I'm not saying that's what they did. I don't know what they did.)

        Many process will generate a time series that is bound above (and/or below) in probabilistic or deterministic sense (random walk is a one that's not). If we can estimate the process that generates the values of a variable (such as bugs), we can tell the highest possible value of that variable.

        One may feel, bugs generated by an underlining random process? It makes no sense. Well, the process is merely a model for prediction. It makes no difference if it objectively exists or not as long as the model gives us the right answer. (Think about how a lot of people found quantum mechanics absurd--which is just a model that works.)

        Treating bugs as random process means we assume there're correlation among bugs (temporal, spatial or whatever). Otherwise it's just white noise and a meaningless model. On the other hand, correlation complicates the estimation in random sampling. So, we can always explore the underlining structure of a variable and choose a right model and methodology accordingly to our advantage.