Perl: the Markov chain saw | |
PerlMonks |
Yet Another Fair Shuffle Nodeby tlm (Prior) |
on Apr 04, 2005 at 13:34 UTC ( [id://444670]=perlmeditation: print w/replies, xml ) | Need Help?? |
I realize that I'm risking excommunication by posting YAFSN, but I felt that I was beginning to repeat myself, so rather than reply to this latest comment deep in the bowels of Roy Johnson's very interesting functional shuffle thread, I figured I would just give the issue its own node, since it has little to do with functional programming, and I sense it may be a point of confusion. In the thread When the Best Solution Isn't, and more recently in Is this a fair shuffle?, it was shown that certain "sort-based" shuffle algorithms do not sample the space of permutations uniformly (i.e. are not "fair"). More recently still, I posted a "sort-based" algorithm, which I dubbed the "Schwartzian Shuffle" because it uses the Schwartzian Transform that Perl programmers know and love, though the algorithm is much older than Perl, and probably goes by many other names (the one I recall seeing is "tag sorting"): The algorithm basically tags every element of the original array with a float randomly chosen in the unit interval and sorts the tagged elements according to the numerical order of the tags. Tagging-sorting-untagging is what the Schwartzian Transform is all about; this is just an unusual special case in which the tags bear no relation the tagged elements. The fairness of this algorithm (setting aside finite-precision artifacts) is pretty obvious, I think. Any formal mathematical proof I can come up with of the algorithm's fairness is pretty tedious, especially without the benefit of basic mathematical notation, but here is a sketch of how one such proof would go. The probability that the first element in the input list gets the tag with the lowest value, and therefore ends in the first position of the shuffled list, is 1/N. The probability that it gets the second-lowest-numbered tag equals the probability that it doesn't get the lowest-numbered tag (i.e. (N - 1)/N) times the probability that it gets the lowest numbered among the remaining N - 1 tags (i.e. 1/(N - 1)); this product is also 1/N. It's easy to see the induction. Contrast this with one of the sort-based algorithms mentioned in When the Best Solution Isn't, for example qshuf: The role of sorting in this algorithm is entirely different from its role in tag sorting. Here the sort's comparator has been modified to effect the shuffling. It is not as easy to determine the fairness or lack thereof of this algorithm through a mathematical argument (at least it's not as easy for me), but a simple operational test shows that it is far from fair. And yes, as the article that Roy cited shows, the tag-sorting algorithm is not perfectly fair due to artifacts having to do with the finite precision of the random numbers used as tags (which means that there is a small but non-zero probability that two tags will have the same value, and this combined with the stability of the sorting leads to deviations from fairness; I blather more on this here if anyone is in the mood for a li'l hair-splittin'). But this is not an intrinsic flaw of the algorithm, and at any rate, its deviation from perfect fairness is too minuscule to lose sleep over (unless you are a Haskell programmer, I suppose :-) ). The principal reason not to use tag sorting is that it is not as efficient, either in space or time, as Fisher-Yates (which by the way, has minuscule deviations from perfect fairness of its own, again due to finite-precision artifacts). the lowliest monk
Back to
Meditations
|
|