in reply to Software Design Resources
In the mideighties, I was briefly involved in the development of an IBM internal language called PLS86. A descendant of PL1 language that IBM started developing in the late 50s and early 60s, it was targeted at the x86 platform and one of its major design goals was to allow the production of "mathematically provably correct programs". The language was briefly considered for use in the development of IBM's then topsecret, and asyet unnamed, OS/2. (SO inventive. Not!). Anyway, the language was dropped (for use in developing OS/2) not least because IBMs then partners in this development, MS, refused to use it and so C was used instead.
I was friends with one of the guys who did considerable work on the development of the "mathematical proof" bit of the language. I don't recall (probably because I didn't really understand much of it) all of the details, but there was a point in the development of the language, which went ahead anyway, where the provability requirement was quietly dropped.
The reason came down to the simple statement: It wasn't possible!
Some of the bits I remember that lie behind the conclusion are:
Even with it being a totally precompiled language, without the dynamic features of some other languages (eg. eval in perl, INTERPRET in REXX etc.), unless the program is entirely statically linked, there is no way to ensure that you will not get failures due to the unavailability or incorrect runtime linking of dynamically bound code segments.
In order to prove the program is correct, it becomes necessary to add additional code to verify the inputs to every part of an algorithm and more code to verify the output. However, there are many problems with this.
 The first set are that the only effective ways of doing this for any possible set of inputs and outcomes are
 to duplicate the entire algorithm using an alternative set of instructions.
This is the method used by some lifecritical projects. For example, the software used for airliner flybywire systems is developed from the same specs by three different software teams in complete isolation (cleanroom environments) and targeted at 3 different (types of) processors. There is then a fourth processor that supplies the same input to each of the three and compares their outputs.
If one processor reaches a different conclusion to the other two, it's result is ignored. If it consistently produces a different result, then that processor is flagged as being broken. The idea is that by using three different types of processor, no two are likely to be affected in the same way by a hardware, microcode or software bug (think Pentium floating point bug!).
By having the teams that develop each set of software work in isolation, it is less likely that they will make the same (wrong) assumptions. So long as two of the three processors arrive at the same result at any given step of the process, it's quite likely that they are correct.
Whilst this method works for the most part, it is hugely costly and so not cost effective for most types of software.
 to have a lookup table that maps inputs to outputs.
Even allowing for the fact that the storage required to maintain these tables would be huge for anything other than the most trivial of projects, there is still the possibility that the inputs can be corrupted before they reach the process. Memory (RAM or DISK) failure, sensor failure, human input error etc.
Beyond even that, the resultants in the lookup tables would have to be generated. For anything beyond the trivial this would need to be done using software.
Catch22, how do you ensure that that program is correct?
 The second set of problems are that the additional code added to verify algorithmic correctness are
 themselves softwareand so subject to bugs, and failures.
Catch22 again,
 adds more code to the project thereby increasing complexityComplexity and volume are the prime multipliers for bug rates.
all the additional processing costs hugely in terms of both development and execution time.
People have, and still do, attempt to address this by making the test/ verification code conditional. That is to say, they add the tests and verification code in such a way that it can be 'turned off' in the production environment. The problem with that is that removing the code has a measurable, but sometimes unpredictable effect on the overall algorithm. This is effectively the same as the Quantum Effects principle in Physicsyou cannot measure something without effecting it.
That isn't the complete story, but it's all I can remember for now.
What this means is, that for any realworld, costed, project, it becomes necessary to define a level of reliability that is 'acceptable' and then design your development and testing criteria to achieve that.
Testing is good, but it is not a silver bullet to bugs. It is impossible to achieve 100% coverage of tests. The first major project I was involved in testing (the GUI component of OS/2), had over 700 APIs, with an average of 4 parameters per API and some with as many as 13! Many of those parameters are themselves (pointers to) structures which can consist of as many as 40+ discrete fields, and each field can (for example) be 32 individual booleans; or integer values with ranges of + 2**31 or 04**32; or floats ranging from something like 1e308 to 1e+308.
Do the math! Even without considering the effects of interdependencies between the APIsyou can't draw a line until you've obtained a presentation space, you can't obtain a presentation space until you've obtained a device context and so onthe numbers become astronomical. Considering testing everything simple isn't possible given the projected life span of the earth, never mind that of human developers:)
Given you have an experienced programmer developing yetanotherversion of some project that s/he has done several of before, then they will probably be able to make some educated guesses about where the edge cases are and there by reduce the sets of tests to a manageable size. However, it was proved fairly comprehensively (to my satisfaction anyway), by some work done by another bit of IBM around the same time, that even experienced developers make bad guesses as to where the edge cases are when they move to projects that are even slightly different to those they have done before. In fact, that particular team showed that they make worse than random choices! And the more experienced they were (in similar but different project types), the worse their guesses were.
The upshot was that we ended up developing a random test case generator. The logic went like this. If you have a few computers generating random but valid sequences of API calls, along with other code to test the resultant program, then with you can use statisticsthe number of programs generated v the number of them that failedto determine the rate at which bugs are being found. By rerunning every test cases generated, both good and bad, each time a new build was released, you get an effective regression test. You also get a statistical measure of the rate at which earlier bugs are being fixed, which of them are reappearing etc. You can also break the statistics down by component, developer, day of the week etc. etc. This allows you to target your efforts to where they are of greatest benefit.
The effect of this was amazing and salutary. There had been many, many test cases written prior to the RTG being available. WIthin a month it was shown that all of the test case produced by programmers/ testers targeting their efforts according to their (or their superiors) best judgment, had covered less than 15% of the total APIs with 10% having been duplicated over and over, 5% of the possible parameter combinations, and found less than 1% of the possible bugs.
Don't ask me how they arrived at this last statistic, it was way to much voodoo for me, but I can tell you that within two months, they were beginning to predict the number of new bugs that would be found, old bugs that would be cured and old bugs that would resurface in any given week (with twice daily builds) with amazing accuracy. It was right around this time that the project got moved from the UK to the US and I never heard much more about it.
You might find this article interesting Coming: Failsafe software. The only way the software industry is going to move out of the metaphorical iron or maybe bronzeage, is when we start using computers to assist us in our work.
It's a strange thing. If you describe a problem in almost any field of endeavour to a programmer, he will nearly always suggest a way that he could improve or solve that problem using a computerexcept when the field of endeavour is his own!
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
Re: Re: Software Design Resources
by chunlou (Curate) on Aug 22, 2003 at 07:42 UTC

Within a month it was shown that all of the test case produced by programmers/testers targeting their efforts according to their (or their superiors) best judgment, had covered less than 15% of the total APIs with 10% having been duplicated over and over, 5% of the possible parameter combinations, and found less than 1% of the possible bugs.
According to the limited description, the repeated 10% "duplication" was probably due to "best judgment" bias. If random sampling is used instead, it's almost impossible to have such a duplication across testers and overtime (though you still will have duplications "locally").
"Best judgement" bias is like trying to estimate the total number of Perl programmers in the world by asking people what language they use in Perl websites alone.
The "1% of the possible bugs" estimate was possibly derived from something like the catchandrelease method I mentioned in my another reply in this thread.
The the unprobabilistic approach of "best judgement" and the probabilistic estimate of "1% of the possible bugs" seem strange to occur together, however.
* * *
* * *
In case "probability" may sound like a voodoo, consider this: In a population of 100, if all are ages 25, what sample size do you need to come up with a 100% confidence estimate of the population average age? One, of course.
If 90 of them are of age 30 and 10 age 20 (average 29), a random sample of size one give you an "average" 30 90% of the time. Pretty good "estimate" actually, considering you don't have to ask all 100 of them.
The worst case (in term of sample size needed) is a 50/50 split.
So, a population of one million all aged 25 only need a sample size one to get a good (perfect in this case) estimate, whereas a population of 30, 10 aged 30, 10 aged 20, 10 aged 5 needs a larger sample size.
The moral: the quality of a statistical estimate is affected by the heterogeneity of the population, not its size. It's very counterintuitive to many people, I know.
 [reply] 

Sorry. I don't think that I made that bit very clear. The statistics I gave were for the coverage achieved by the teams of test case writers prior to the introduction of the Random Testcase Generator. That is to say, there where a bunch of coders charged with the task of sitting down and writing programs to exercise given subsets of the APIs. They used thier 'best judgement' to write the programs such that they covered the edge cases of each individual function an combination of functions.
The 15% was a purely mathematical count of the APIs exercised derived by simply grep'ing and counting them from the assembled test suit.
The 10% duplication meant that of the 15% that had actually been exercised, two thirds of them had been exercised in more than one testcase. For some parts of the API set this is enevitable. You can't do anything when testing a GUI API set without having called CreateWindow() for example, but this did not explain all the duplication.
Much of it came down to the fact that given any two programmers with similar experience, their best judgement, based on their prior experiences, will lead them to similar conclusions about what needs testing. Hence, they will tend towards testing similar things. Even though they are each assigned a particular set of API's to test, it's enevitable that there will be some overlap. Given a team of nearly 100 programmers from different backgrounds, you would think that their ranges of experience would lead to a fairly wide coverage, but it doesn't happen that way. They all tend to concentrate their efforts on similar clusters of "suspect" APIs. Worse, they all tend to assume that some APIs are not necessary to test, for similar reasons.
As for the 1% of possible bugs. The bit that I consider to be tantamount to voodoo, is the determination of the number of possible bugs. In order to state that "only 1 had been found", it is necessary to know how many were found and how many could have been found. How do you begin to determine how many there could be?
I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly. This is fairly simple population sampling, standard deviation stuff. You only need to know that the sample is a truely random selection from the total population. You don't need to know the total population size.
But to conclude that 1% of possible bugs had been discovered by a set of testcases that the previous 2 statistics went soley to prove that their generation was anything but random, from the determanistic count of those that had been found, means that they had to have determined, or at least estimated to some degree of accuracy, the total possible bug count.
I have a good degree of faith in the guys doing the work, and I was treated to nearly four hours of explanation of the methodlogy involved. The program that produced that statistic ran on a top of the range S370 quad processor system and consumed prodigious amounts of cputime. The datasets were not very large.
It involved a iterative process of refining a complex polynomial with an infinite number of terms, until it approximated the discovery rates and coverage that had been determined by counting. Once the polynimial in question had been refined until it closely approximated the realworld statistics it was developed to model, it was then iterated forward into the future to project to a point where no more bugs would be discovered. In real time this would have amounted to decades or maybe centuries. Once that point was reached, they then had the estimate of the number of bugs that could be discovered and it was this figure that was used to calculate the 1% figure.
Beleive me. This went way beyond graduate level statistics with which I was familar with at that time, though I have since forgotten much of it.
I'm going to stick to my guns and say that this was the deepest statistical voodoo that I have any wish to know about:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
 [reply] 

That sounds interesting: it seems like they did EM to estimate the coefficients of an infiniteterm polynomial (hmm, it couldn't *really* have been infinite, so either the polynomial converged or they just took many millions of terms). Once they believed they had modeled the rate of bugdiscovery as a function of time they just solved the equation for bugdiscovery = 0. As you imply, actually finding the roots of a polynomial that large would be too challenging, so they just scanned forward in time until they found the first (+ve) root.
Just guessing, but that seems a reasonable approach to do what you're describing. Does that sound about right?
 [reply] 


I think I'm the one who's not being clear, not you.
I suppose we're both talking about the "best judgement" introducing bias.
I fully understand the mechanism whereby it is possible to estimate how many bugs will be found on the basis of how many have been found, and projecting that forward, once the test cases are being produced randomly.
Perhaps the fuzziness of human language get in the way here. Any estimate is to estmate how many could be found, never ever how many will be found. To see that, I'll use the catchandrelease example.
Suppose the total number (the actual T) of unknown bugs is actually 100. Tester One was assigned with 20 (A) test cases; Tester Two 20 (B) also. 2 (C) bugs in common were found. The estimate is 200 total (possible) bugs (notice the large margin of error). Does it mean you will find 200 bugs given infinite time? Of course not, since we already know that there're 100 actual bugs. The estimate is 200, nevertheless. 200 is the possible total bugs you could find, based on actual available counts at the moment.
The technique and the skillset will affect the accurary of an estimate but the principle is still the same.
* * *
* * *
One side note, not to critique their method, just to provide complementing information, one should be careful when using a polynimial to fit data. Polynimial can fit any mathematical functions, given enough degrees (it's a theorem). Similarly it can fit any data, include white noise.
Consider you're testing the response time of your server in response to various levels of workload. You try a linear fit (a straight line) and polynomial of degree two (a+bx+cx^2). The polynomial fits the data better and you have the following.
X X
. . X . *
. . X . *
X . *
.X .
.X . .
X
.: data points
X: fitted to actual data
*: prediction, extrapolation
But it doesn't fit into the common sense (response time improves as the workload increases). This kind of error is very hard to detect in higher dimension, especially when you don't actually know what to expect.
The moral: A more complicated model does not always improve your prediction; it could even worsen it in some cases.
 [reply] 






