Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

A bad shuffle

by tlm (Prior)
on Mar 20, 2005 at 21:05 UTC ( #441060=perlmeditation: print w/ replies, xml ) Need Help??

I recently came across the following code for generating a "random permutation"1 of an array's elements.
sub naive_shuffle { my $n = my @p = @_; for my $i ( 0..$n-1 ) { my $j = int rand $n; @p[ $i, $j ] = @p[ $j, $i ]; # swap } return @p; }
It looks reasonable enough, if perhaps a bit zealous (we know that only N - 1 swaps are required, since fixing a permutation of N elements is a problem with N - 1 degrees of freedom; see "correct algorithm" below). But the real problem with this subroutine is that it does not sample the space of permutations uniformly.

The easiest way to see this is to realize that what this algorithm does sample uniformly is a space of possible execution paths, as determined by the values returned by the expression int rand $n at the various iteration steps. The size of this sample space is NN. To each element of this space corresponds a permutation, but the size of the space of all possible permutations is N! , which is not only smaller than NN for any N > 1, but more importantly, it is not a divisor of NN, which means that it is impossible for the algorithm to give equal weight to all the permutations. Some permutations will be generated more often than the others. For example, in the case of N = 3, there are 27 execution paths; of the 6 possible permutations of 3 elements, 3 of them are generated by 5 execution paths each, and the other 3 by 4 paths each. Hence the former are oversampled relative to the latter.

If this argument were not convincing enough, or if we are trying to assess a sampling algorithm for which a rigorous argument as to its correctness is not readily apparent, we can just run the algorithm many times, keeping a tally of the permutations generated, to get an idea of how even-handed the sampling was. The only problem with this approach is that the sampling noise puts a limit on the size of the deviations from uniformity that can be detected, and the cost of reducing this noise grows quadratically with the number of sampling runs. Be that as it may, this is the first thing to try, because it's trivial to implement.

If the lazy approach is not acceptable for whatever reason, an alternative is to modifify the algorithm to systematically follow all the possible execution paths, and keep a count of the permutations obtained in the process of traversing all paths.

To beat a dead horse, and to indulge in a free-association to a reference I made in another post, here's a "forking paths" strategy to pin down the algorithm's actual sampling. We modify naive_shuffle (taking great care not to alter the sampling) so that instead of generating one permutation, it generates all the possible permutations implied by the multiple evaluations of int rand $n.

The first modification is to "recursify" the loop:

# untested sub naive_shuffle_rec { my $i = shift; my $n = my @p = @_; my $j = int rand $n; @p[ $i, $j ] = @p[ $j, $i ]; # swap return ++$i < $n ? naive_shuffle_rec( $i, @p ) : @p; }
Here we rely on recursive calls and an additional "state argument" to handle the iteration of $i from 0 to $n-12.

The next step is more delicate. It involves, essentially, replacing the expression int rand $n with all the possible execution paths implicit in this expression. Here's one way to do it:

sub naive_shuffle_traverse { my $i = shift; my $j = shift; my $n = my @p = @_; @p[ $i, $j ] = @p[ $j, $i ] if defined $j; # swap if ( ++$i < $n ) { for my $j (0..$n-1) { naive_shuffle_traverse( $i, $j, $tally, @p ); } } else { return; } }
A call to
naive_shuffle_traverse(-1, undef, 1..3);
causes the program to traverse all possible execution paths of the original naive_shuffle( 1..3 ). The business with the -1 value for $i is because now the first call to the function is to set things up for $j.

The only remaining modification is an easy one: we need some way to have the routine keep track of the permutations visited so far. For this we use an additional state variable $tally:

sub naive_shuffle_count { my $i = shift; my $j = shift; my $tally = shift; my $n = my @p = @_; @p[ $i, $j ] = @p[ $j, $i ] if $j > -1; if ( ++$i < $n ) { for my $j (0..$n-1) { naive_shuffle_count( $i, $j, $tally, @p ); } } else { $tally->{ "@p" }++; return; } }
Now, executing
my %tally = (); naive_shuffle_count( -1, undef, \%tally, 1..3 );
results in the following contents for %tally:
DB<2> x \%tally 0 HASH(0x82d7f94) '1 2 3' => 4 '1 3 2' => 5 '2 1 3' => 5 '2 3 1' => 5 '3 1 2' => 4 '3 2 1' => 4
showing that with this algorithm some permutations are "more equal than others."

At any rate, the correct algorithm, if a monk may allow himself to be categorical about it3, is this one:

sub random_perm { my $n = my @p = @_; for ( my $i = $#p; $i > 0; --$i ) { my $j = int rand( $i + 1 ); @p[ $i, $j ] = @p[ $j, $i ]; } return @p; }
Note that the size of the space of possible execution paths for this algorithm is exactly N!.

the lowliest monk


1To put it more pedantic terms, this function is supposed to sample uniformly at random the space of all the permutations of an array's elements.

2As an aside, though this modification involves making the sub recursive, the algorithm is still iterative. This is because the subroutine is tail-recursive, which means that its space requirements, in principle at least, do not grow with the size of the input. For more on this see here, especially here.

3One of the many enduring coinages of the mathematician Pál Erdös was the phrase "straight from The Book", which was his way of pronouncing a mathematical proof as perfect, and which made playful reference to a book where god keeps all mathematical secrets. Well, if this book exists, it surely has a chapter or two for algorithms, and if so, the algorithm used by rand_perm above, which I learned from Knuth, surely is in there. BTW, the subject of candidates for the chapter on algorithm in The Book would be a fine one for another meditation.


Update: I found a page that gives a nice discussion of the algorithm used by naive_shuffle, where it is described as "short, elegant, clear, and wrong". Also, much more material here, although I object to this page's description of "Knuth's algorithm" (the one used by random_perm) as being "harder to code". This page does point out something that I failed to mention, namely that naive_shuffle gives correct results for the case of N = 2, although it works twice harder to achieve it than needed, like a man who, facing a choice of two things, flips a coin twice, disregards the first flip, and takes the outcome of the second.

Comment on A bad shuffle
Select or Download Code
Re: A bad shuffle
by Tanktalus (Canon) on Mar 20, 2005 at 21:36 UTC

    Your stats knowledge/understanding seems to be way above mine ... but wouldn't this be just as good:

    sub shuffle { my @in = @_; my @out; push @out, splice(@in, rand @in, 1) while @in; @out; }
    ? I always get kinda concerned about swapping, I don't know why, when a simple random ordering would suffice. But, I didn't really do that well in stats in university, and that was altogether too long ago to remember anyway. :-) Comparing speeds ...
    use strict; use Benchmark qw(cmpthese); my @cards = 1..52; cmpthese(-1, { shuffle => sub { shuffle(@cards) }, random_perm => sub { random_perm(@cards) }, }); sub random_perm { my $n = my @p = @_; for ( my $i = $#p; $i > 0; --$i ) { my $j = int rand( $i + 1 ); @p[ $i, $j ] = @p[ $j, $i ]; } return @p; } sub shuffle { my @in = @_; my @out; push @out, splice(@in, rand @in, 1) while @in; @out; }
    which gives:
    Rate random_perm shuffle random_perm 6000/s -- -30% shuffle 8606/s 43% --
    which says that if random_perm is the optimal pseudo-random shuffler, then my code must be suboptimal in that it's skipping something in making it random. For the life of me, I can't think of what.

    Update: Changed the benchmark to actually test a real shuffle (woops) - thanks to anonymonk below. Anonymonk points out that this doesn't scale to 52000, which is quite valid for those who need to sort "large" amounts. For those who are only sorting a deck of cards (a very popular subset of shuffling), this may be sufficient ;-)

      Your algorithm looks perfectly fine to me. random_perm is optimal in the limited sense that it achieves a random permutation in place with the miniminum number of swaps. (You can't do this problem in place without swapping.) It is also nice in that its implementation varies little from language to language, and in that respect it is pretty general. In contrast, to implement the algorithm like the one you posted in C is not straightforward; Perl does a lot behind the scenes to allow one to write splice(@in, rand @in, 1) while @in.

      That said, I confess that I am quite surprised by the benchmarks. Then again, this is not by any means the first time that intuitions from C programming as to the efficiencies of various operations turn out to be waaaay off when the problem is translated to Perl. Your post is a good reminder to benchmark, benchmark, benchmark. And just to act on this reminder, I decided to benchmark against your original a version of shuffle that omits the last call to splice:

      sub shuffle_2 { my @in = @_; my @out; push @out, splice(@in, rand @in, 1) while @in > 1; push @out, @in; @out; }
      And once again, my intuition is way off:
      Rate shuffle_2 shuffle shuffle_2 5856/s -- -5% shuffle 6140/s 5% --

      BTW, I should clarify that I don't recall the real name of the algorithm used by random_perm, if it actually has a name.

      Update: As pointed out by several others, the algorithm used by random_perm goes by the name of Fisher-Yates. I've also found it referred to as "Knuth's shuffle."

      the lowliest monk

      Considering that shuffle and random_perm are called without arguments, and are hence shuffling empty lists, your benchmark doesn't show anything interesting. If you change the @_ in both subs to @cards, I get the following results:
                    Rate random_perm     shuffle
      random_perm 6796/s          --        -32%
      shuffle     9955/s         46%          --
      
      It may appear that shuffle is faster. But look what happens when we shuffle 52000 cards instead of 52:
                  s/iter     shuffle random_perm
      shuffle       1.02          --        -85%
      random_perm  0.158        550%          --
      
      It's well known that a shuffle based on splice doesn't scale, due to its quadratic behaviour.

      Never use a splice based shuffle. For every millisecond you will gain if you shuffle a small list you'll pay a second when shuffling a large list.

      For those who are only sorting a deck of cards (a very popular subset of shuffling), this may be sufficient ;-)
      Well, about all you gain is 50 microseconds. The 'shuffle' sub isn't significant smaller or easier to understand than 'random_perm'. They are both four lines of code, not counting lines consisting of just closing braces.

      There isn't a real argument to not choose a solution that scales. If 50 microseconds are important to you, you should have used List::Util::scalar in the first place.

Re: A bad shuffle
by Zaxo (Archbishop) on Mar 20, 2005 at 22:39 UTC

    The naive_shuffle() you object to is Fischer-Yates, which is pretty well studied. I think you misapprehend the effect of the algorithm.

    While it samples a permutation of the elements, it does not do so by selecting from a list of all permutations, and execution paths have nothing to do with it.

    After Compline,
    Zaxo

      Thank you for the information. Perhaps I should have made clearer in that my encounter with this algorithm, which motivated the whole write up, was in research-oriented/scientific code in which it was being used to sample permutations of an array uniformly at random. My contention is that this is a misuse of this algorithm, because it does not sample the space of permutations uniformly.

      But in light of what you write about the algorithm's standing and pedigree, the title of my meditation is an awful one. Maybe the whole node should be retracted for the sake of not confusing others. Let me know what you think.


      Update: I found a version of Fisher-Yates online (linked from here):

      #include <stdlib.h> void shuffle(int *array, size_t n) { if (n > 1) { size_t i; for (i = 0; i < n - 1; i++) { /**/ size_t j = i + rand()/(RAND_MAX / (n - i) + 1); /**/ int t = array[j]; array[j] = array[i]; array[i] = t; } } }
      This is equivalent to the algorithm used by my random_perm, not the one used by naive_shuffle. To get the latter algorithm, the lines indicated with /**/ above would have to be changed to:
      for (i = 0; i < n; i++) { size_t j = rand()/(RAND_MAX / n);
      the lowliest monk

        No, don't withdraw it. I think the thing to do is try and figure out the difference between a "fair shuffle" and a uniformly distributed selection over permutations.

        Are the two the same if there are identical cards in the deck? What is the problem that motivated this?

        Sometimes confusion, like greed, is good.

        After Compline,
        Zaxo

      Actually, the original "naive_shuffle" is not a Fisher-Yates shuffle, it is a "naive shuffle" implementation. The OP's analysis is correct.

      His final algorithm, however, is a correct Fisher-Yates implementation.

        perlfaq4: How do I shuffle an array randomly? not only gives a "canonical" implementation of the Fisher-Yates algorithm in Perl, but also refers to the List::Util module's shuffle function, which is an implementation of Fisher-Yates in C. The algorithm is also implemented in Abigail's Algorithm::Numerical::Shuffle, the doco of which gives some citations into the literature (Knuth, Fisher&Yates, etc.).

        The Fisher-Yates has also been discussed many times on clpm. (You can do a Google Groups search. My ego compels me to link to this posting by yrs trly, which isn't about F-Y, but uses it.)

      No, you're wrong. It looks like Fisher-Yates, but there's a slight difference, in that all his array items can move again on every loop. His "correct algorithm" is actually Fisher-Yates, where per loop, one item gets moved into its final position.
        Bart,

        That's what I said ;-)

        Please re-read the node you are replying to.

        I said the "original ... is not a Fisher-Yates..." and "The final is a correct Fisher-Yates..." - which is exactly what you are asserting now.

        Sorry Bart, I thouht you were replying to me...
Re: A bad shuffle
by chas (Priest) on Mar 21, 2005 at 05:08 UTC
    "But the real problem with this subroutine is that it does not sample the space of permutations uniformly."
    It is certainly true that the method described initially does not sample the space of permutations uniformly because the permutations utilized are all a product of $n transpositions, so in particular they all have the same parity (even if $n is even, etc). Zaxo's comment that there may be some difference between a "fair shuffle" and a uniformly distributed selection over permutations" is perceptive. A shuffle might be considered "fair" if it tends to remove specific orderings that exist (e.g. in playing cards, one wants to nullify the effect of various groupings that have occurred in previous hands.) On the other hand, if one wants to model what actually occurs in a sequence of "random" shuffles, then restricting to permutations of fixed parity seems objectionable. It's a bit hard to define what "randomizing" shuffle means; Perci Diaconis (a highly respected mathematician) has studied this and written some good articles on the subject which should be found easily by web search.
    chas
      Shuffling is a classical problem in the algorithms literature. I've encountered the problem many times, and a "fair shuffle" always means that any permutation of the input list has an equal chance of being the outcome. With other words, there's no difference between "fair shuffle" and "uniformly distributed selection over permutations".

      Only you like to play word games, and start redefining well known terms.

        Well, that is a way to define "fair", but my initial comment was that *that* doesn't hold for the first method presented since each of the possible permutations has the same parity.
        (The problem of a "random" shuffle is a real one and not word play at all. Consider a dealer shuffling cards in a poker game; how should he do it to maximize the randomness of the result - should he shuffle three times, 7 times, ...? Should the shuffles be somewhat careless (so the cards son't interleave perfectly? (Yes to latter!) Here the idea is that there will be some fixed shuffling routine (e.g. shuffle, shuffle, box, shufffle), but one wants the result to be sufficiently randomized. This isn't precisely the same thing as what the OP asked, but is often what is actually desired when one discusses the question of a "fair" or "random" shuffle. This seemed to be alluded to in one of Zaxo's replies, and as I mentioned it has been analyzed in the liturature.)
        chas
Why reinvent the wheel?
by DentArthurDent (Monk) on Mar 21, 2005 at 16:59 UTC
    I've done some research on combinations and permuations in perl, and after a bit of hunting, I found:

    Math::Combinatorics

    This should do exactly what you want. I must say however, that I've never used it, so I don't know about it's relative speed and so on.
    ----
    My mission: To boldy split infinitives that have never been split before!
Re: A bad shuffle
by itub (Priest) on Mar 21, 2005 at 17:41 UTC
    Note that in practice, even with a perfect shuffling algorithm not all permutations will have equal probability when using a typical rand() function. The problem is that your typical rand() function uses an integer as a seed, and the number of possible seeds is tiny compared to the number of permutations for "large" lists (where "large" might mean something greater than 12).

    For example, if you are shuffling 52 items, you have 52! or about 10^68 permutations. Compare that to the roughly 10^10 values that are possible for a 32-bit integer!

      Fair enough. But people might be using /dev/random or /dev/urandom to generate their random numbers.
Re: A bad shuffle
by QM (Vicar) on Mar 24, 2005 at 00:28 UTC
    The size of this sample space is NN. To each element of this space corresponds a permutation, but the size of the space of all possible permutations is N! , which is not only smaller than NN for any N > 1, but more importantly, it is not a divisor of NN, which means that it is impossible for the algorithm to give equal weight to all the permutations.
    If I'm not mistaken, N! | NN, as N! = 1*2*...*N, and NN = N*N*...*N. Therefore, NN/N! = NN-1/(N-1)!

    How does this affect your comment?

    Update: Oops, I was mistaken.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      How does this affect your comment?

      It doesn't. The important point is that (N**N mod N!) != 0 for N > 2, i.e., N**N balls can not be evenly distributed among N! buckets.

      NN-1/(N-1)! is not an integer for any integer N > 2, so it is not the case that N! | NN, except for N=1 and N=2.

      the lowliest monk

      Update: Here's a proof of the assertion made above. I'm sure there are better proofs of it, but this is the best I could come up with.

      Assume that N > 2, and let p be the largest prime in the prime factorization of N. There are three cases to consider. Suppose first that p is 2. Then, by assumption, N is a power of 2 greater than or equal to 4. Therefore, 3 is a factor of N!, and consequently N! does not divide NN. Next, suppose that p > 2. If N = pk for some nonnegative integer k, then N is odd and not divisible by N! whose prime factorization includes 2. This leaves the case in which p > 2, and is not the sole prime factor of N. In this case N >= 2 p. By Bertrand's postulate there exists a prime q such that p < q < 2 p <= N. Therefore, there is a factor of N!, namely q, that does not divide Ns, for any positive integer s. Therefore, N! does not divide NN, for all N > 2.

      If I'm not mistaken, N! | NN, as N! = 1*2*...*N, and NN = N*N*...*N. Therefore, NN/N! = NN-1/(N-1)!
      The fact that NN and N! share a factor doesn't mean that N! is a divisor of NN, as the following table shows:
       N                 NN                 N!             NN % N!
       1                  1                  1                  0
       2                  4                  2                  0
       3                 27                  6                  3
       4                256                 24                 16
       5               3125                120                  5
       6              46656                720                576
       7             823543               5040               2023
       8           16777216              40320               4096
       9          387420489             362880             227529
      10        10000000000            3628800            2656000
      11       285311670611           39916800           26301011
      12      8916100448256          479001600          443667456
      13    302875106592253         6227020800         5268921853
      14  11112006825558016        87178291200          294332416
      15 437893890380859375      1307674368000       820814907375
      
        My bad. Somehow the coffee fairy didn't stop by me, and I'm confusing "share a factor" with "evenly divides".

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://441060]
Approved by Joost
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-07-31 04:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (244 votes), past polls