|Just another Perl shrine|
the -other- shuffling questionby mstone (Deacon)
|on Jun 13, 2005 at 02:55 UTC||Need Help??|
mstone has asked for the
wisdom of the Perl Monks concerning the following question:
Does anyone know of a good analysis on shuffled random number generators?
I'm not talking about things like the Fisher-Yates shuffle. I'm talking about the idea mentioned -- almost as an afterthought -- in Seminumerical Algorithms: you seed the output of an RNG into an array, then use the RNG's sequence to select and replace items from the array. It does wonders for the RNG's period, as well as improving its spectral behavior.
Sample code is below the fold.
To demonstrate the idea, let's start with a crappy little LCRNG whose behavior is easy to see:
Its period is 8, and the output sequence is: 1 6 7 4 5 2 3 0. Larger generators work exactly the same way, they just use bigger values for $M and $A.
LCRNGs are the workhorses of non-crypto random number generation, but they're too weak for applications that need to be secure. It's too easy to extract $M and $A from a sequence of output values. If you take the items N at a time and graph them in an N-dimensional space, the resulting points tend to fall 'on the lines'. The two-dimensional graph for the sequence above looks like this, for instance:
To shuffle the sequence, we feed it through an array like so:
The double-assignment there is basically a one-step Fisher-Yates shuffle.. we replace $Y with an item from the un-shuffled part of @T (i.e.: all of it), and replace the selected item with a new value from lcrng().
This new sequence performs.. quite a bit better than lcrng(). There's a little turbulence as the array loads itself, then the output runs about 2200 numbers before settling into a periodic sequence 60,440 numbers long. And here's the graph for 16 numbers from the periodic sequence:
It's much less tidy than the previous version.
If we increase $M to 16, the period of the repeating sequence weighs in at 55,470,480. Not bad for what are arguably two of the worst RNGs out there.
Thing is, those numbers fall far below the theoretical capacity of the system. We have $M possible values coming out of lcrng(). We have $M stored values in the array, and another $M possible values stored in $Y. In theory, that means there are $M^($M+1) possible configurations for the system. For $M=8, that means 8^9 (2^27) (134,217,728) is the longest possible period, and for $M=16 it's 16^17 (2^68) (2.95e20).
Obviously, there are configurations this version can never reach. Given the short period of lcrng() for $M=8, there's no way to reach the configuration:
and there will also be inefficiencies as some configurations fall into orderly patterns which don't use the entire array.
Thing is, I've only been able to find a handful of papers that even mention shuffling, and none of those offer anything in the way of analysis. Given the huge boost to performance and the low cost of implementation, I should think someone has devoted some head-time to the subject.
Does anyone know of any resources on the subject, or is this just one of the dead-ends that no one has ever bothered to consider?