Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
...in order for your probability calculation to make sense, the sample(s) used to estimate the total population are required to be random samples.

That's correct.

... the testcases the programmers produce are done on the basis of experience (or best judgement)... the testcases they produce only cover 1% of the possible test cases and detect 2 bugs, there is no way to project the total number of bugs from that.

If I gave you the wrong impression that the "best judgement" sample constituted a random sample based on which the number of possible bugs were estimated, my bad. I think we both know how random sampling works.

I don't know what technique they actually used (if I did, I would be a psychic and it would be voodoo), so I can't explain how their stuff works. But I can tell how such estimation of possible bugs possible.

One simple possibility is to use regression (i.e. a model-based estimation), as illustrated below.

     |     .      
     |   .  /  . .
no.  |   . / . . 
of   | . ./ .
bugs | . /  .
     |  / .
     +----------------
      no. of testcases

If there's correlation (not necessarily linear) between number of testcases (independent variable) and number of bugs (dependent variable), we could use regression to estimate the total number of possible bugs assuming the total number of testcases are known and bound.

There's no limit what and how many independent variables you may use, nor what model.

Speaking of voodoo, in time series (since you indirectly mentioned Mandelbrot which made me think of fractal made me think of time series), you can do a bispectrum test to test if a series is linearily predictable or not without knowing what kind of process that generated the series. Pretty cool "voodoo." It's like saying I don't know where Homer came from but I'm sure he's blind.

And financial time series often almost follow a random walk process which sometimes result in a "long memory" process. That is, the underlining process is scale-independent. In other words, if x(t) = a x(t-1) + e, where e = random noise, you get (more or less) the same "a" regardless the unit of measurement, be it daily, weekly, etc. Hence, the process is self-similar (statistically). Hence, it's a "fractal"!

Since a random variable (such as number of bugs) or better yet a random/stochastic process could be a special case of fractal, that's where Mandelbrot (the "statistician") could come in.

*     *     *     *     *     *

Since I mentioned correlation, I might as well point out, what I didn't mention in the previous discussion of bugs estimation was "margin of error" (heard on TV often) or variation or variance (didn't want to confuse people with too many new concepts).

If two random variables (say, numbers of bugs found by two testers--the number of bugs itself could be treated as random variable, even if the testcases are not randomly selected) are correlated, a positive correlation will lead to higher variance, whereas negative lower. The intuition goes like this: negative correlation leads to cancelation; hence less variance (10 + -10 = 0), while positive correlation is like things tend to come all at once; hence higher variance (10 + 10 = 20).

Since bugs tend to have positive correlation (not due to sampling), a simple random sampling estimate based upon independence assumption underestimate the variance, "margin of error" or the severity of the bugs situation.

*     *     *     *     *     *

That leads us to talk about bugs (more precisely, number of bugs) as random variable/process. You can consider the "randomness" is a result of 1) random sampling or 2) the underlining process that generates those bugs.

Bugs as random variable due to random sample we have talked about. Bugs as random process is a new topic, which I suppose was what your people were doing back then back there.

I mentioned time series (a random process) and fractal and Mandelbrot. Since bugs could be a random process could be a time series could be a "fractal," it wouldn't be hard for Mandelbrot to figure out that the total possible bugs could be related to the upper bound of a time series. (I'm not saying that's what they did. I don't know what they did.)

Many process will generate a time series that is bound above (and/or below) in probabilistic or deterministic sense (random walk is a one that's not). If we can estimate the process that generates the values of a variable (such as bugs), we can tell the highest possible value of that variable.

One may feel, bugs generated by an underlining random process? It makes no sense. Well, the process is merely a model for prediction. It makes no difference if it objectively exists or not as long as the model gives us the right answer. (Think about how a lot of people found quantum mechanics absurd--which is just a model that works.)

Treating bugs as random process means we assume there're correlation among bugs (temporal, spatial or whatever). Otherwise it's just white noise and a meaningless model. On the other hand, correlation complicates the estimation in random sampling. So, we can always explore the underlining structure of a variable and choose a right model and methodology accordingly to our advantage.


In reply to Re:^6: Software Design Resources by chunlou
in thread Software Design Resources by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-19 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found