Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Fisher-Yates theory

by Anonymous Monk
on Jul 24, 2003 at 04:37 UTC ( #277423=perlmeditation: print w/replies, xml ) Need Help??

Hello.. I'm writing a paper wherein array randomize there. I'm looking for Fisher-Yates shuffle on the net and many links only give me the FY code. Can anyone help me to enlighten 'bout FY shuffle theory? Thanks...

Replies are listed 'Best First'.
Re: Fisher-Yates theory
by MarkM (Curate) on Jul 24, 2003 at 05:39 UTC

    In addition to what sauoq wrote:

    I'm not convinced that the Fisher-Yates shuffle actually has very much theory with regard to random order involved at all.

    On the surface, each array entry has the appearance of being given a single opportunity at being swapped with another array entry. In practice, array entries nearer to the beginning of the array have additional opportunities of being swapped (as a result of later swaps), meaning that less shuffling happens at the end of the array than at the beginning.

    One could argue in great length whether additional shuffling in the array improves the randomness, or whether it produces patterns in the longer term, but it isn't difficult to argue that regardless of which conclusion is valid, the fact that the entries nearer to the beginning have additional chances of being shuffled, while entries nearer the end have fewer chances, that *one* of these ends will be improved more than the other, therefore the algorithm is not optimal.

    UPDATE: See node Fisher-Yates theory... does this prove that it is invalid? for my 'proof' that Fisher-Yates weights the results.

      I'm not convinced that the Fisher-Yates shuffle actually has very much theory with regard to random order involved at all.

      Well, you are wrong. The first publication of the Fisher-Yates shuffle dates from 1938. It's example 12 in the book Statistical Tables, whose authors are R.A. Fisher and F.Yates.

      It was also discussed by R. Durstenfeld, in 1964. Communications of the ACM, volume 7, pp 420.

      And of course Donald Knuth mentions it, in Volume 2 of The Art of Computer Programming. In the third edition, it's algorithm P, section 3.4.2 on page 145.

      Abigail

      Wow...
      by RMGir (Prior) on Jul 24, 2003 at 13:29 UTC
        ++

        If there was ever a node that deserves to make the Best Nodes, that's it...

        I bookmarked it so I can read that article from ACM's digital library.

        UPDATE: I read the paragraph in the ACM digital library, for algorithm 235. How did you ever find that citation? It is the algorithm, but he doesn't credit Fisher-Yates...
        --
        Mike

        UPDATE: As pointed out by others, I made an error when translating the code. See their summaries for good explanations. Cheers, and thanks everyone. (tail between legs)

        In a previous article, I was challenged for doubting the effectiveness of the Fisher-Yates shuffle as described in perlfaq.

        Below, I have written code that exhausts all possible random sequences that could be used during a particular Fisher-Yates shuffle. Statistically, this should be valid, as before the shuffle begins, there is an equal chance that the random sequence generated could be 0 0 0 0 0 as 0 1 2 3 4 as 4 4 4 4 4. By exhaustively executing the Fisher-Yates shuffle, and calculating the total number of occurrences that each result set is produced, we can determine whether the Fisher-Yates shuffle has the side effect of weighting the results, or whether the shuffle is truly random, in that it should be approximately well spread out.

        my $depth = 5; my %results; sub recurse { if (@_ >= $depth) { my @deck = (1 .. $depth); shuffle(\@deck, [@_]); $results{join('', @deck)}++; } else { recurse(@_, $_) for 0 .. ($depth-1); } } sub shuffle { my($deck, $rand) = @_; my $i = @$deck; while ($i--) { my $j = shift @$rand; @$deck[$i,$j] = @$deck[$j,$i]; } } recurse; for (sort {$results{$b} <=> $results{$a}} keys %results) { printf "%10d %s\n", $results{$_}, $_; }

        With the above code, I was able to determine that with a deck size of 5, and an initial set of 1 2 3 4 5, there is three times the probability that the resulting set will be 3 1 2 5 4 than the probability that the resulting set will be 2 3 4 5 1. To me, this indicates that this theory is flawed.

        If anybody needs to prove to themselves that the test is exhaustive, print out "@$rand" in the shuffle subroutine.

        Please analyze the code carefully, pull out your school books, and see if I have made a mistake.

        Cheers,
        mark

        Again with the "you are wrong." Why the acerbic and argumentative tone? Up to this point, this appears to be a friendly conversation about their views on the theory, which was informative and interesting.

        If someone says they're not convinced, you can't disprove that. One may heap on additional evidence to attempt to convince them, but they may continue to be unconvinced for rational or irrational reasons.

        This site often discusses matters of faith. There's no right or wrong about being convinced. "You are wrong" demarks an absolutism, both in passive grammar and in connotation. I prefer a community which demonstrates respect for others' faith and others' opinion.

        Fisher-Yates may have been proven effective by way of analyzing the results for uniform potential entropy. It may have been proven by logical analysis of the probabilities on each successive swap. Sharing that evidence is helpful, but I ask politely for everyone to build a constructive community, not one which promotes controversy and arrogance.

        --
        [ e d @ h a l l e y . c c ]

      On the surface, each array entry has the appearance of being given a single opportunity at being swapped with another array entry. In practice, array entries nearer to the beginning of the array have additional opportunities of being swapped (as a result of later swaps), meaning that less shuffling happens at the end of the array than at the beginning.

      That is what intuition says, but in this case intuition falls short of reality. A Fisher-Yates shuffle avoids (not introduces) bias by making the pool smaller for each iteration. There are n! possible permutations of a set of n items. After the first iteration, an F-Y shuffle gives n possible outcomes, each equally likely. The second iteration yields (n - 1) for each of the n possible outcomes, leaving us with n*(n-1) possibilities - again equally likely. Follow that to its conclusion, you get n(n-1)(n-2)...1 possibilities, each equally likely.

      For an example take a 3 item set. There are 3! (= 6) possible permutations of this set if it is shuffled. The first iteration of the loop, there are three possibilities: a-(1 2 3), b-(2 1 3), and c-(3 2 1). The second iteration only swaps the 2nd and 3rd elements, so for a you have an equal possibility of (1 2 3) and (1 3 2); for b - (2 1 3) and (2 3 1); for c - (3 2 1) and (3 1 2). None of the possibilities are duplicated, each one has a 1/6 chance of being selected.

      Iter 1 - Iter 2 (1 2 3) -> (1 2 3) -> (1 3 2) (1 2 3) -> (2 1 3) -> (2 1 3) -> (2 3 1) (3 2 1) -> (3 2 1) -> (3 1 2)

      Six possibilities, each equally likely.

      Another way to look at it is this: The first element has a 2/3 chance of getting swapped the first time and a 1/2 chance the second - giving it a 1/2 * 2/3 = 1/3 chance of ending up in any given slot.

      Update: Finally, Re: Re: Re: random elements from array - new twist shows a statistical analysis of a Fisher-Yates shuffle. Whew, I'm done. I hope this wasn't homework - or if it was, Anonymous Monk learned something ;-)

      One could argue in great length whether additional shuffling in the array improves the randomness, or whether it produces patterns in the longer term,

      If your RNG was perfect, additional shuffles would neither improve or degrade the randomness.

      but it isn't difficult to argue that regardless of which conclusion is valid, the fact that the entries nearer to the beginning have additional chances of being shuffled, while entries nearer the end have fewer chances, that *one* of these ends will be improved more than the other, therefore the algorithm is not optimal.

      Unless neither conclusion is valid (see above) in which case a single shuffle is no better or worse than multiple shuffles and the algorithm is optimal.

      In practice, however, our RNGs aren't perfect and you may have a point. I suppose you could resolve it by keeping track of the original indexes of the elements and then apply your map and do a second FY shuffle starting with the previous last and moving backwards to the previous first... still linear in time and space... but I doubt it is worth it. :-)

      -sauoq
      "My two cents aren't worth a dime.";
      
        The biggest problem with shuffling algorithms (not just Fisher-Yates) is the imperfectness of the RNG. The sequence of numbers it returns is determined by the seed, and only by the seed. And if from the seed only 32 bits are being used, there will be at most 2**32 different sequences. Which means that for a deck of more than 2 cards, Fisher-Yates cannot be 'fair', as N! won't be a divisor of any power of 2. Furthermore, 13! > 2**32, so even if you shuffle a deck of 13 cards, some permutations will *never* be selected, no matter how many times you perform the shuffle.

        This was pointed out by R. Salfi: COMPSTAT 1974, Vienna: 1974, pp 28 - 35.

        See also the documentation of the Shuffle module on CPAN.

        Abigail

        One could argue in great length whether additional shuffling in the array improves the randomness, or whether it produces patterns in the longer term,

        If your RNG was perfect, additional shuffles would neither improve or degrade the randomness.

        It seems that additional "shuffling" is required to prevent a permutation of the array where there is a correlation between items that are swapped.

        It seems the "additional shuffling" is required.

        Consider the shuffling of an array of three items without the so called "additional shuffling": Starting from the left:

        Initial: abc 1 move: abc bac(1/3) cba(1/3) 2 move: abc(1/6) acb(1/6)
        Updated: Per Abigail's criticism of a misstyped table and obscurity. To clarify:

        I take issue with:

        On the surface, each array entry has the appearance of being given a single opportunity at being swapped with another array entry. In practice, array entries nearer to the beginning of the array have additional opportunities of being swapped (as a result of later swaps), meaning that less shuffling happens at the end of the array than at the beginning.

        One could argue in great length whether additional shuffling in the array improves the randomness, or whether it produces patterns in the longer term, but it isn't difficult to argue that regardless of which conclusion is valid, the fact that the entries nearer to the beginning have additional chances of being shuffled, while entries nearer the end have fewer chances, that *one* of these ends will be improved more than the other, therefore the algorithm is not optimal.

        Which seems to say that the possible movement of an already moved item is a weakness of the algorithm. But an array cannot be randomly shuffled in place without allowing elements to be moved more than once.

        The idea that the errors of a RNG will more adversely affect the items to which that RNG is more often applied is an argument that is more subtle. I reject it out of hand. If the RNG is incorrect the results will not be random and we are quibling about how obvious the error is. This distinction may be very important in practice and there are practical solutions to get a pseudo-randomness that is okay for varying definitions of okay.

      IMHO the definition of a "good" shuffle algorithm is that starting from a sequence, each of its permutations has the same probability to be the outcome of the shuffle on that starting sequence.

      Many "home grown" algorithms violate this criterion and hence are to be considered, well, bad shuffle algorithms. Fisher & Yates satisfies this criterion however (the proof is left as "homework" to the original poster).

      You can have a look at Algorithm::Numerical::Shuffle for references as well as a short discussion.

      Just my 2 cents, -gjb-

Re: Fisher-Yates theory
by sauoq (Abbot) on Jul 24, 2003 at 05:09 UTC

    The following is straight out of the FAQ. Use perldoc -q shuffle to read the whole entry.

    sub fisher_yates_shuffle { my $deck = shift; # $deck is a reference to an array my $i = @$deck; while ($i--) { my $j = int rand ($i+1); @$deck[$i,$j] = @$deck[$j,$i]; } }

    Update: Uh. Gee... you did ask for the theory not the code. Sorry. What is there to tell you? It's O(N) for obvious reasons and it works on the data in place so memory usage is minimal. It's a very straightforward algorithm.

    -sauoq
    "My two cents aren't worth a dime.";
    
      Well, there's more to an algorithm than its run-time complexity. You also need to prove the algorithm is correct. That is in this case show that the resulting array is a permutation of the original one (fairly trivial), and that each permutation has an equal chance of being selected (a bit more work, but still not to hard, if you can assume your random generator produces real random numbers).

      For details, see the Knuth reference I made in another post in this thread.

      Abigail

Re: Fisher-Yates theory
by Skeeve (Parson) on Jul 24, 2003 at 05:58 UTC
    If you look at this this you will find that there is a term in statistics called "Fisher-Yates Test". So maybe our "homeworker" might want to look at that.

    BTW: This is not a perl question.

      I consider algorithm related question, which can be used in Perl should be allright on this forum, if not in SOPW.

      artist

        No, not really. You see in the title it says "perlquestion"(update: in my title it does, cause I have a custom node title definition), which it most certainly it is not. Had it been approved in the meditations category, it would still be homework, but it might be appropriate. Had he included his attempts at implementing said algorithm, then it would be ok in SOPW(perlquestion).

        On a slight sidenote, PerlMonks is a community. Say it with me, not a forum ;)

        MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
        I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
        ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Fisher-Yates theory
by bart (Canon) on Aug 06, 2003 at 12:47 UTC
    The Fisher-Yates shuffle is an algorithm I came up with myself, independent of any previous work... So I'll give you my view on it, can I?

    As and example take a deck of cards, 52 of them. Pick any card. You'll agree that the card chosen is completely random, no? No preference towards the old order? OK. Now the most important step: take it out of the deck. That's your first card of the new, shuffled deck.

    Now you have a deck of 51 cards left. Pick one, any one... That's your second card, completely random. Now you have a deck of 50 cards left...

    Repeat, taking one random card out of the deck each time, and adding it to the new deck, until you're left with an empty deck, with no cards left. Your new deck now contains all cards and is shuffled completely randomly.

    As an optimisation for the implementation, you can see that the total number of cards in both decks is always the same. So instead of physically creating two decks, we take a paper bookmark and insert it between the old and the new deck. We move the chosen card out of the old deck and into the new deck, simply by swapping the chosen card and the former first card of the old deck, and then moving the bookmark one position up so that it's now the last card of the new deck.

    Since the order of the new deck is completely independent of the order in the old deck, doing it this way, changing the order of the old deck while you're at it, doesn't influence the statistical properties of the end result, at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://277423]
Approved by ybiC
Front-paged by ybiC
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2020-07-02 12:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?