 good chemistry is complicated,and a little bit messy -LW PerlMonks

### Re^4: Functional shuffle

by tall_man (Parson)
 on Apr 02, 2005 at 17:00 UTC ( #444397=note: print w/replies, xml ) Need Help??

in reply to Re^3: Functional shuffle

I'm basing my objection to the sort-based shuffle on this note in the paper that Roy Johnson referenced:

Furthermore, if we have a sequence of N elements and associate with each element a key -- a random number uniformly distributed within 0, M-1 (where N!>M>=N), we have the configurational space of size M^N (i.e., M^N ways to key the sequence). There are N! possible permutations of the sequence. Alas, for N>2 and M<N!, M^N is not evenly divisible by N!. Therefore, certain permutations are bound to be a bit more likely than the others.

I would further argue that even when M (the period of the psuedo-random number generator) is larger than N, M^N is still not divisible by N! in general. The Fisher-Yates shuffle applies decreasing probabilities as it moves down the deck; the sorting shuffle uses uniform ones. That is the key difference.

I believe the discussion under A bad shuffle is also applicable.

Replies are listed 'Best First'.
Re^5: Functional shuffle
by tlm (Prior) on Apr 02, 2005 at 17:38 UTC

Strictly speaking this is correct, but similar arguments apply in one way or another to any shuffling algorithm that relies on a PRNG, since the number of possible shuffled configurations can easily exceed the period of the PRNG, and even when it doesn't, the size of the set of possible numbers (2k, where k is the number of bits available to represent these numbers) generated by the PRNG won't be evenly divisible by N!. The argument I made in my previous post was based on the simplifying assumption that the indexes used are real numbers in the unit interval (as opposed to, what is in fact the case, members of a large but finite set of rationals). In this case there is zero probability that in a set of N > 1 randomly chosen numbers in [0, 1] two will be equal, and therefore the problems introduced by the fact that the sorting is stable become irrelevant. True, in fact the probability that any two such numbers generated by a PRNG are equal is not quite zero, but it is very small.

the lowliest monk

There a large difference between the fairness of the algorithm, and fairness of any implementation with practical constraints.

A sort-based shuffle is an algorithm that is not fair. It's rotten at the root. An implementation that suffers from an inperfect random generator has fairness problems due to environmental constraits.

Don't confuse the two things.

Re^5: Functional shuffle
by tlm (Prior) on Apr 02, 2005 at 20:46 UTC

The argument in the cited article apply to any shufling algorithm that uses anything other than random binary 2k-ary choices (in a program running on a binary-representation computer). This is true even if the numbers are generated by an ideal uniform random process (as opposed to a PRNG), and then stored in computer memory (necessarily truncated to a finite precision). In other words, the problem is deeper than the one caused by the finite periods of PRNGs. Consider applying Fisher-Yates to sorting an array of 3 elements. The first step requires selecting one of three options at random, and is done by testing whether a "random" number (or rather, a rational number obtained from truncating the precision of a randomly generated real number) stored in some computer register is within certain bounds or not. The cardinality of the set of such numbers is 2w, where w is the number of bits in the register, and therefore it is impossible that each of the three elements in the list will be chosen with probability of exactly 1/3. With the exception of the very last one of those in which the element to swap is chosen from among 2k alternatives, for some integer 0 < k < w, all the swaps in Fisher-Yates (under the conditions stated) result from a non-uniform sampling.

The question remains of whether the magnitude of this deviation from perfect uniformity is one that can significantly impair the intended use of the algorithm, and the answer of course depends on the application. In the case of the example above, the magnitude of the relative error grows as N/2w, so I imagine that a simulation that relies on a uniform sampling of the space of all rearrangements of a large number of elements may have to start worrying about this effect.

I reiterate that there is a fundamental difference between the flawed algorithms mentioned in When the Best Solution Isn't, and those that are based on sorting a set of pseudo-random number tags (like the one I posted). With the former, the deviation from the correct distribution can be substantial, and would occur even if one could use infinite precision, perfectly uniformly sampled random numbers, whereas with the latter this deviation is no bigger than it is for any shuffle algorithm that uses anything other than random binary 2k-ary choices (and of course, would disappear if the random numbers were of infinite precision). Therefore, the flaws in the former algorithms are logical flaws independent of numerical errors.

Update: There's a small inaccuracy in the argument I presented above, but correcting it does not affect its gist. Above I say that uniform sampling can occur only when the number of choices is 2; this is incorrect; it can occur only when the number of choices is 2k, for some integer 0 < k < w, where w is as defined above.

the lowliest monk

Actually, selecting with an exact probability of 1/3 (or any other rational fraction less than one) is feasible with even a source of single random bits. I saw the technique described once in "Mathematical Games" in Scientific American once.

Represent 1/3 as a binary fraction: .01010101... and generate random numbers bit by bit, starting from the decimal point, calling this a new binary fraction. If the next bit we generate is 1 where there is a 0 in the target, then quit -- we are over 1/3. If the next bit is 0 where there is a 1 in the target, then quit -- we are under 1/3 and we can accept the case. If it's equal, keep generating bits. The probability we will have to continue adding bits halves with each bit.

This approach can make the Fisher-Yates shuffle arbitrarily accurate. It would be possible, but messy, to apply it to sort values that compared equal, too. With this enhancement both shuffles should be fair, but Fisher-Yates wins by being O(N) instead of O(NlogN).

This approach can make the Fisher-Yates shuffle arbitrarily accurate. It would be possible, but messy, to apply it to sort values that compared equal, too.

I fail to see the difference. Certainly one can make any numerically-limited algorithm "arbitrarily accurate" if one is willing to increase the number of bits used to represent the numbers in the calculation. The variation of Fisher-Yates that you propose would require a special check to handle the highly improbable case that the random number generator produced a number that was exactly equal to k/N, for any integer k in 0..N - 1, in order to generate more bits to break the tie (without this provision, the algorithm is identical to the standard Fisher-Yates as far as the uniformity of the sampling is concerned). Likewise, the tag-sorting shuffle algorithm I posted would need to be modified to handle the highly improbable case that two of the random tags happened to be identical (which would result in a stable-sort artifact), by generating more bits to break the tie.

...but Fisher-Yates wins by being O(N) instead of O(NlogN).

Yes, of course, but the speed superiority of Fisher-Yates has never been in question. My position all along has been limited to defending the algorithm I posted against the claim that it was logically flawed in the same way as the sort-based algorithms discussed in When the Best Solution Isn't are. The problem with those algorithms would remain even if we had infinite-precision computers at our disposal; this is not the case for the sort-based algorithm I posted. Furthermore, in comparison to the errors incurred by those truly flawed algorithms, the errors incurred by numerically-limited algorithms like Fisher-Yates or the one I posted are entirely insignificant.

Update: Fixed minor typo/error above: the range 0..N - 1 mentioned towards the middle of the first paragraph was erroneously given as 1..N - 1 in the original post. Also, immediately before in the same sentence, the reference to the variable k was missing.

the lowliest monk

Create A New User
Node Status?
node history
Node Type: note [id://444397]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2020-01-22 08:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
The worst excuse I have ever heard is:

Results (100 votes). Check out past polls.

Notices?