Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Testing for randomness

by DrHyde (Prior)
on Oct 23, 2003 at 14:01 UTC ( #301587=perlquestion: print w/ replies, xml ) Need Help??
DrHyde has asked for the wisdom of the Perl Monks concerning the following question:

I recently uploaded Net::Random to the CPAN. It gathers data from a couple of online sources of truly random data (which I trust to really *be* random, that's not the issue here), and uses that to generate random numbers in the user's chosen range. For instance, you might want a bunch of random 0s and 1s to simulate tossing a coin, or random numbers from 1 to 6 to simulate a die roll.

Given that I trust the original data to be random, I still need to be sure that what I'm doing to the data isn't biassing it.

Such bias could be introduced in various ways, the two I can think of off the top of my head are:
  • my algorithm sucks
  • an off-by-one error
but there are no doubt other ways I could screw up. The whole point of testing is that I don't need to know in advance how I might have screwed up, the tests just show that I *have* screwed up.

The question is, then, how to test that my output data is nice and random? I initially thought of using Jon Orwant's Statistics::ChiSquared module, but that has a couple of big drawbacks:

  • it thinks a coin that throws 500 heads followed by 500 tails is just fine and dandy;
  • it's limited to 21 discrete values because of the way its implemented
The second of those is a headache that can be worked around. The first, however, is a showstopper. That test can't detect certain types of obvious bias. So, what I'm looking for is a module that:
  • can determine whether data is evenly and randomly distributed across its range and is equally evenly distributed regardless of which part of the sample i look at (ie the first 20 values should be just as random as the next 100); and
  • can determine whether the data is at all predictable (ie can it detect that if the die rolls a 1 it's likely to roll a 4 three rolls later, or if it rolls a 1 it won't roll a 1 next time)

I'm not aware of anything on CPAN that can do that. An alternative would be - and we can do this because I'm only concerned about whether *I* am introducing bias, not with whether the data is biassed - to check that the distribution of my results is the same as the distribution of the original data. But I'm not aware of anything to do that either.

So, can anyone point me at any appropriate modules? Or at an algorithm that I could turn into a module?

Comment on Testing for randomness
Re: Testing for randomness
by gjb (Vicar) on Oct 23, 2003 at 14:59 UTC
Re: Testing for randomness
by hardburn (Abbot) on Oct 23, 2003 at 15:02 UTC

    While there are a few things you can do to check for randomness (such as looking for an even statistical distribution), a comprehensive randomness test is incomputable.

    I was researching this very subject a while back. IIRC, a Google search for "randomness test" came up with a few good responses. It might also be worth searching the sci.crypt newsgroup archives.

    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    :(){ :|:&};:

    Note: All code is untested, unless otherwise stated

Re: Testing for randomness
by jdtoronto (Prior) on Oct 23, 2003 at 15:15 UTC
    There was available from Hans Leeb at The University of Salzberg a RNG tester programme. I don't think it was ever on their site, but you emailed Leeb and he would send it to you. It is some years since I last spoke to him, but I am sure he would be pleased to help. A student of mine used it to validate some simulation work he did.

    It would appear that Hans is not at the University of Vienna, here is his current web page


Re: Testing for randomness
by bl0rf (Pilgrim) on Oct 23, 2003 at 15:16 UTC
    I dont know much about randomness, but I suggest you take
    a look at this article and in the reference section
    you will find ( what I think to be ) a program for
    visually modelling data. The visual models provide
    insight into how well data is distributed.

    My site

Re: Testing for randomness
by eric256 (Parson) on Oct 23, 2003 at 15:17 UTC

    500 heads followed by 500 tails, realy could happen in truly random data. I think thats part of why testing for such is impossible. It would be far better to show what algorithm you are using, so that we could then invistigate it for obvious bias. Just my 2 cents. defines random as - Of or relating to a type of circumstance or event that is described by a probability distribution.

    Given that definition you could just pick 1000 numbers and then make sure that they present a distribution that you expect. Good luck.

    Eric Hodges
      Yes 500 heads followed by 500 tails could come from a fair coin. It's pretty damned unlikely to be random though, yet the example in Jon Orwant's module's documentation says that such an outcome is over 90% likely to be from a random source!

      The suggestion someone else put forward of instead of putting individual values into the buckets for the test to put sequences of values in would solve that.

Re: Testing for randomness
by zakzebrowski (Curate) on Oct 23, 2003 at 15:51 UTC
    If you're looking for a non-scientific but warm fuzzy approach, I would generate about, oh, 100 files with n runs, with each character representing 8 runs. (So you get an ascii character). Then, I would make a copy of the file and gzip it. Compare the sizes of the orignal and new file, and the better job the compression does, the less randomness it is... (This approach was sort of discussed a long time ago in perl monks about checking for randomness of passwords...)

    undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;
      This is not a great idea. It's perfectly possible to have highly non-random data that produce worst-case performance in a given compression algorithm. You'd be better off with statistical tests (i.e. chi square).
Re: Testing for randomness
by Anonymous Monk on Oct 23, 2003 at 17:06 UTC

    I have several suggestions:

    1. Do not try the compression-based attempt listed above, it isn't likely to work
    2. You can't tell from a single test if a stretch of 500 heads followed by 500 tails is non-random: you'd have to permute or at the very least replicate
    3. You can (should!) test higher-order statistics. Right now, you perform a chi-squared distributional test on 0-th order statistics (number of heads, number of tails). Instead, perform that test on the number of pairs (e.g. 1st order statistics): HH, HT, TH, TT. You can extend that to higher-orders if you are really interested, but first order should be enough to find bias, and easy to implement.

    So in summary, I recommend you perform multiple (100s) tests on 0th, 1st, and 2nd order statistics.

      You can't tell from a single test if a stretch of 500 heads followed by 500 tails is non-random: you'd have to permute or at the very least replicate

      The 500H,500T sequence would be extremely good evidence of a non-random generation. By my reckoning, in any one trial of 500 tosses, you have about 1 chance in 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 (== 10**50) of throwing all heads or all tails.

      The odds of throwing the sequence are therefore about 1 in 10**100 - a googol. If that is not a good enough result, how would replicating the result help? You just change the odds to 1 in 10**200.

Re: Testing for randomness
by snax (Friar) on Oct 23, 2003 at 18:44 UTC
    The gold standard for testing randomness is the Diehard suite of tests, which is some stuff that George Marsaglia put together in the 90s at Florida State. The maintenance of this suite leaves a little to be desired, and it seems that development ended abruptly, but the suite is still good. It is designed to work on 32 bit ints in binary files, though, so you'll have to do some work to make it fit, but the routines do all sorts of tests for randomness you might not think of (runs, high/low bit randomness, some serial correlation checks, and so forth). There are 16 tests in all.

    At the very least, I suggest you download the Diehard source and read the file tests.txt to get an idea of the kinds of things you could do.

Re: Testing for randomness
by bart (Canon) on Oct 23, 2003 at 20:56 UTC
    Just a thought... if you were to listen to these samples as audio, what you should be hearing is pure, constant noise. The color/sound of the noise is an indication of the interindependence of the samples. I think what you should hear, for independent samples, is white noise — all linear frequency ranges represented equally strong.

    Perhaps you could use FFT or similar, on a large enough sample, to figure out the spectrum of the sample?

Re: Testing for randomness
by OverlordQ (Hermit) on Oct 24, 2003 at 02:58 UTC
    Check out my node here although not exactly the same as your question, I also was looking for information on testing (P)RNGs
      Thanks, I'll check out ent as well as diehard.
Re: Testing for randomness
by jonadab (Parson) on Oct 24, 2003 at 03:40 UTC

    One major issue with testing for randomness is the matter of sample size. You seem to be implying that randomness requires an even distribution, but this is incorrect, at least with a small-to-moderate sample size; in fact, if your module *guarantees* an even distribution, then it is absolutely not properly random. Random data will *converge* on an even distribution as the sample size increases, but the sample size often needs to be quite large before the distribution is really very even. If you roll a ballanced die a hundred times, it is quite likely that some numbers will be significantly better represented than others. If you roll it a thousand times, the distribution will seem a bit more even, but it still won't be fully even, usually, not even enough that you can be sure about the die; you need to roll the die ten thousand or a hundred thousand times to really be sure whether the die is ballanced.

    500 heads followed by 500 tails is an extreme, but a ballanced coin may very well throw heads 600 out of the first thousand times. That doesn't imply it's necessarily weighted 60% toward heads; another time the same coin tossed in the same way might throw tails 600 out of a thousand times. In fact, if it threw heads *exactly* 500 out of the first thousand times, I'd suspect it was rigged or the data falsified. 537 times out of a thousand would be *MUCH* more likely. 600 is a tad bit high, but still quite possible. If you really need to test the ballance of the coin, you need to throw it another fifty thousand times or so.

    Now, if you were testing a local random number generator, this wouldn't be any big deal; let it run a million times in two and a half minutes. However, since you're getting your random data from the internet, this may be an issue. As others have suggested, you may want to examine your algorithm for bias.

    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Testing for randomness
by cbraga (Pilgrim) on Oct 24, 2003 at 04:03 UTC
    I can't believe no one mentioned this yet.

    <sig>ESC[78;89;13p ESC[110;121;13p</sig>
      Um, why?

      I had never come across it before. But it sure looks interesting. The cited references on this page and on the mathematics pages of the same site add further weight to the need for Knuth in all his forms (as Computer Scientist and Mathematician) in college libraries.


Re: Testing for randomness (spectral testing)
by BrowserUk (Pope) on Oct 24, 2003 at 07:09 UTC

    You might find this technique interesting.

    #! perl -slw use strict; use vars qw[ $X $Y $N $D ]; use GD; $|=1; #use constant { SEED => 0, A => 65537, C => 65539, M => 65536 }; # Rea +lly BAD!! #use constant { SEED => 0, A => 32717, C => 32719, M => 65536 }; # Bad + enough!! my $rnd = SEED; sub badRnd{ $rnd = (A * $rnd + C) % M; $rnd % $_[0] } $X ||= 1024; $Y ||= 1024; $N ||= 100_000_000; $D ||= 10; GD::Image->trueColor( 1 ); my $img = GD::Image->new( $X, $Y ); my $color =0; my $incr = 1; $img->fill( 0, 0, 0xffffff ); for( 0 .. $N ) { # my( $x, $y ) = ( int badRnd( $X ), int badRnd( $Y ) ); # printf "[$x:$y] "; my( $x, $y ) = ( int rand( $X ), int rand( $Y ) ); $img->setPixel( $x, $y, $img->getPixel( $x, $y ) + 1 ); printf "\r$_ " unless $_ % (1000); display() unless $_ % ($X * $Y * $D); } sub display { open IMG, '>plot.png' or die $!; binmode IMG; print IMG $img->png; close IMG; system( 'start /b plot.png' ); }

    What this does is plots pairs of random numbers on an X-Y grid and increments the color value at the pixel plotted. If the PRNG is a good one, the color will tend to increase linearly and evenly over time. However, if the generator is a bad one, then the colour will tend to clump and string. With really bad generators (as with the two Linear congruential generators commented out in the code), the clumping will slowly form a lattice work where the sequence starts to repeat.

    This implementation is fairly crude, but does demonstrate the technique quite vividly. The nice thing is that you can test for the interaction between the natural periodicity of the generator, and the reduction in periodicity that occurs when the sequence is used modulo some number as with typical usage.

    Running this against perl rand function, at least the one that comes with AS 802, for 100 million iterations modulo 1000, and it shows it to be a pretty good generator with a very low level of clumping.

    I originally saw this method (spectral testing) performed on a very powerful 3d workstation, with a dynamically adjustable alpha setting that allowed you to 'see through' the cube of plotted data and see how the latticing appears throughout in regular striations.

    I wish I could have posted a couple of comparative outputs here, but you can get a feel for how the technique works from the graphics on this page. The output from the program above looks remarkable similar to the diagrams about halfway down that page.

    The display() sub above is coded to automatically display the .png produced every so often, (controlled by the -D=N), under Win32 assuming that you have a picture viewer associated with the .png extension.

    Update: For some instant gratification, try running the program with the settings:

    P:\test>test -X=256 -Y=256 -D=1

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

Re: Testing for randomness
by PERLscienceman (Curate) on Oct 25, 2003 at 12:11 UTC
    So, can anyone point me at any appropriate modules? Or at an algorithm that I could turn into a module?

    Greetings Fellow Monk!
    I did some digging on CPAN and found a few related modules that I thought that you might find of interest/use. I have listed a few that immediately caught my eye below along with thier brief CPAN description:
    In addition, a general CPAN Search on the term random yields some interesting possibilities.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://301587]
Approved by broquaint
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2014-12-21 01:30 GMT
Find Nodes?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?

    Results (100 votes), past polls