perlquestion
DrHyde
I recently uploaded [cpan://Net::Random] to the CPAN. It gathers data from a couple of online sources of truly random data (which I trust to really *be* random, that's not the issue here), and uses that to generate random numbers in the user's chosen range. For instance, you might want a bunch of random 0s and 1s to simulate tossing a coin, or random numbers from 1 to 6 to simulate a die roll.
<p>Given that I trust the original data to be random, I still need to be sure that what I'm doing to the data isn't biassing it.<readmore> Such bias could be introduced in various ways, the two I can think of off the top of my head are:
<ul><li>my algorithm sucks<li>an off-by-one error</ul>
but there are no doubt other ways I could screw up. The whole point of testing is that I don't need to know in advance how I might have screwed up, the tests just show that I *have* screwed up.
<p>The question is, then, how to test that my output data is nice and random? I initially thought of using Jon Orwant's [cpan://Statistics::ChiSquared] module, but that has a couple of big drawbacks:
<ul><li>it thinks a coin that throws 500 heads followed by 500 tails is just fine and dandy;<li>it's limited to 21 discrete values because of the way its implemented</ul>
The second of those is a headache that can be worked around. The first, however, is a showstopper. That test can't detect certain types of obvious bias. So, what I'm looking for is a module that:
<ul><li>can determine whether data is evenly and randomly distributed across its range and is equally evenly distributed regardless of which part of the sample i look at (ie the first 20 values should be just as random as the next 100); and<li>can determine whether the data is at all predictable (ie can it detect that if the die rolls a 1 it's likely to roll a 4 three rolls later, or if it rolls a 1 it won't roll a 1 next time)</ul>
<p>I'm not aware of anything on CPAN that can do that. An alternative would be - and we can do this because I'm only concerned about whether *I* am introducing bias, not with whether the data is biassed - to check that the distribution of my results is the same as the distribution of the original data. But I'm not aware of anything to do that either.
</readmore>
<p>So, can anyone point me at any appropriate modules? Or at an algorithm that I could turn into a module?