|Problems? Is your data what you think it is?|
A bad shuffleby tlm (Prior)
|on Mar 20, 2005 at 21:05 UTC||Need Help??|
I recently came across the following code for generating a "random permutation"1 of an array's elements.
It looks reasonable enough, if perhaps a bit zealous (we know that only N - 1 swaps are required, since fixing a permutation of N elements is a problem with N - 1 degrees of freedom; see "correct algorithm" below). But the real problem with this subroutine is that it does not sample the space of permutations uniformly.
The easiest way to see this is to realize that what this algorithm does sample uniformly is a space of possible execution paths, as determined by the values returned by the expression int rand $n at the various iteration steps. The size of this sample space is NN. To each element of this space corresponds a permutation, but the size of the space of all possible permutations is N! , which is not only smaller than NN for any N > 1, but more importantly, it is not a divisor of NN, which means that it is impossible for the algorithm to give equal weight to all the permutations. Some permutations will be generated more often than the others. For example, in the case of N = 3, there are 27 execution paths; of the 6 possible permutations of 3 elements, 3 of them are generated by 5 execution paths each, and the other 3 by 4 paths each. Hence the former are oversampled relative to the latter.
If this argument were not convincing enough, or if we are trying to assess a sampling algorithm for which a rigorous argument as to its correctness is not readily apparent, we can just run the algorithm many times, keeping a tally of the permutations generated, to get an idea of how even-handed the sampling was. The only problem with this approach is that the sampling noise puts a limit on the size of the deviations from uniformity that can be detected, and the cost of reducing this noise grows quadratically with the number of sampling runs. Be that as it may, this is the first thing to try, because it's trivial to implement.
If the lazy approach is not acceptable for whatever reason, an alternative is to modifify the algorithm to systematically follow all the possible execution paths, and keep a count of the permutations obtained in the process of traversing all paths.
To beat a dead horse, and to indulge in a free-association to a reference I made in another post, here's a "forking paths" strategy to pin down the algorithm's actual sampling. We modify naive_shuffle (taking great care not to alter the sampling) so that instead of generating one permutation, it generates all the possible permutations implied by the multiple evaluations of int rand $n.
The first modification is to "recursify" the loop:
Here we rely on recursive calls and an additional "state argument" to handle the iteration of $i from 0 to $n-12.
The next step is more delicate. It involves, essentially, replacing the expression int rand $n with all the possible execution paths implicit in this expression. Here's one way to do it:
A call to
causes the program to traverse all possible execution paths of the original naive_shuffle( 1..3 ). The business with the -1 value for $i is because now the first call to the function is to set things up for $j.
The only remaining modification is an easy one: we need some way to have the routine keep track of the permutations visited so far. For this we use an additional state variable $tally:
results in the following contents for %tally:
showing that with this algorithm some permutations are "more equal than others."
At any rate, the correct algorithm, if a monk may allow himself to be categorical about it3, is this one:
Note that the size of the space of possible execution paths for this algorithm is exactly N!.
the lowliest monk
1To put it more pedantic terms, this function is supposed to sample uniformly at random the space of all the permutations of an array's elements.
2As an aside, though this modification involves making the sub recursive, the algorithm is still iterative. This is because the subroutine is tail-recursive, which means that its space requirements, in principle at least, do not grow with the size of the input. For more on this see here, especially here.
3One of the many enduring coinages of the mathematician Pál Erdös was the phrase "straight from The Book", which was his way of pronouncing a mathematical proof as perfect, and which made playful reference to a book where god keeps all mathematical secrets. Well, if this book exists, it surely has a chapter or two for algorithms, and if so, the algorithm used by rand_perm above, which I learned from Knuth, surely is in there. BTW, the subject of candidates for the chapter on algorithm in The Book would be a fine one for another meditation.
Update: I found a page that gives a nice discussion of the algorithm used by naive_shuffle, where it is described as "short, elegant, clear, and wrong". Also, much more material here, although I object to this page's description of "Knuth's algorithm" (the one used by random_perm) as being "harder to code". This page does point out something that I failed to mention, namely that naive_shuffle gives correct results for the case of N = 2, although it works twice harder to achieve it than needed, like a man who, facing a choice of two things, flips a coin twice, disregards the first flip, and takes the outcome of the second.