Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Shuffling CODONS

by bliako (Abbot)
on Jun 08, 2018 at 12:44 UTC ( [id://1216195]=note: print w/replies, xml ) Need Help??


in reply to Shuffling CODONS

Apart from time benchmarks, the suitability of the shuffle algorithm must be assessed with respect to the quality of the randomness of the shuffled array. One way to do this is to calculate the auto-correlation of the shuffled sequence with lag 1 (looking at consecutive elements). The absolute value of the a-c coefficient approaches 1 when the sequence is highly auto-correlated (for example the test array 1..1000) and zero when the opposite happens. So, a good quality shuffle should produce auto-correlations approaching zero.

Edit: suggested test scenario: start with a highly correlated array (e.g 1..1000: perl -MStatistics::Autocorrelation -e 'print Statistics::Autocorrelation->new()->coefficient(data=>[1..1000],lag=>1)."\n"' yields 0.997) and see how the shuffling algorithm de-auto-correlates it by lowering its auto-correlation coefficient towards zero.

Edit 2: auto-correlation coefficient is in the range -1 to 1. Both extremes are for higlhy auto-correlated sequences and zero for no auto-correlation. In this test I take the absolute value of the coefficient.

The following script compares the three methods mentioned here by BrowserUK, tybalt89, List::Util/shuffle with respect to auto-correlation and also, for each trial it plots a histogram of the differences between consecutive elements of the shuffled array, just for fun.

The best shuffle is the one who produces the lowest mean auto-correlation with lowest variance and most successes (i.e. it had the minimum auto-correlation at a specific trial).

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000: List::Util::shuffle : 1693 successes, mean:0.0105896962736892, stdev:0 +.00900688731621982 BUK : 1685 successes, mean:0.010799062825769, stdev:0.0092140346941260 +4 tybalt89 : 1622 successes, mean:0.0102906705829024, stdev:0.0084376063 +2828801

once more:

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000: BUK : 1696 successes, mean:0.0104235933728858, stdev:0.008974970557612 +36 List::Util::shuffle : 1690 successes, mean:0.0106133000677379, stdev:0 +.00908235156157047 tybalt89 : 1614 successes, mean:0.0100835174626996, stdev:0.0089795531 +9759652

once more:

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000: List::Util::shuffle : 1690 successes, mean:0.0104611128054915, stdev:0 +.00886345338184372 BUK : 1658 successes, mean:0.0102429744950854, stdev:0.008480381381372 +49 tybalt89 : 1652 successes, mean:0.0105683142305418, stdev:0.0089906156 +3593633

My opinion: all algorithms work well with respect to randomness (as assessed by auto-correlation) and now we can move to time benchmarks.

TODO: try with a different random number generator (i.e. more reliably uniform).

The test program:

#!/usr/bin/env perl use strict; use warnings; use Statistics::Histogram; use Statistics::Autocorrelation; use Statistics::Descriptive; use List::Util qw/shuffle/; my $N = 1000; my $trials = 50; my %mins = (); for(1..$trials){ my $res = assess_once(); if( exists $mins{$res->[0]} ){ push(@{$mins{$res->[0]}}, $res->[1]->[1]); } else { $mins{$res->[0]} = [$res->[1]->[1]]; } } print "$0 : after $trials trials shuffling arrays of size $N:\n"; foreach (keys %mins){ my @re = @{$mins{$_}}; my $stats = Statistics::Descriptive::Full->new(); $stats->add_data(@re); print $_." : ".scalar(@re)." successes, mean:".$stats->mean(). +", stdev:".$stats->standard_deviation()."\n"; } sub assess_once { my %results = (); my @array = 1..$N; shuffleAry_1( \@array ); $results{'BUK'} = [ histo_of_differences(\@array), corello_abs(\@array) ]; @array = List::Util::shuffle(1..$N); $results{'List::Util::shuffle'} = [ histo_of_differences(\@array), corello_abs(\@array) ]; @array = 1..$N; @array = @{shuffleAry_2( \@array )}; $results{'tybalt89'} = [ histo_of_differences(\@array), corello_abs(\@array) ]; my @keys_sorted_autocor_desc = sort { $results{$a}->[1] <=> $results{$b}->[1] } keys +%results; foreach (@keys_sorted_autocor_desc){ my $hist = $results{$_}->[0]; my $autocor = $results{$_}->[1]; print $_.") Autocorrelation coefficient: ".$autocor."\ +n"; print $_.") Histogram of the differences of consecutiv +e elements:\n".$hist."\n"; print "--------------------------\n\n\n"; } foreach (@keys_sorted_autocor_desc){ my $hist = $results{$_}->[0]; my $autocor = $results{$_}->[1]; print $_.") Autocorrelation coefficient: ".$autocor."\ +n"; } print "assess() : minimum autocorrelation coeff is " .$results{$keys_sorted_autocor_desc[0]}->[1] ." for ".$keys_sorted_autocor_desc[0] ."\n"; print "assess() : done\n"; return [$keys_sorted_autocor_desc[0], $results{$keys_sorted_au +tocor_desc[0]}] } exit(0); sub shuffleAry_2 { my $arr = $_[0]; return [ map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, rand ], @{$arr} ] } sub shuffleAry_1 { die 'Need array reference' unless ref( $_[0] ) eq 'ARRAY'; our( @aliased, $a, $b ); local( *aliased, $a, $b ) = $_[0]; $a = $_ + rand @aliased - $_, $b = $aliased[ $_ ], $aliased[ $_ ] = $aliased[ $a ], $aliased[ $a ] = $b for 0 .. $#aliased; return; } sub corello_abs { my $arr = $_[0]; my $acorr = Statistics::Autocorrelation->new(); return abs( $acorr->coefficient( data => $arr, lag=>1 ) ) } sub histo_of_differences { my $arr = $_[0]; my $N = $#$arr; my @diffs = (0)x($N); for(1..$N){ $diffs[$_-1] = abs($arr->[$_] - $arr->[$_-1]); } return Statistics::Histogram::get_histogram(\@diffs); }

Replies are listed 'Best First'.
Re^2: Shuffling CODONS
by BrowserUk (Patriarch) on Jun 08, 2018 at 16:05 UTC
    TODO: try with a different random number generator (i.e. more reliably uniform)

    Unless the size of the array is approaching the period length of the PRNG, then the quality of the PRNG has little or no effect upon the quality of the shuffling.

    Neither the quality of any given shuffle; nor the quality of successive shuffles; nor the quality of any group of shuffles.

    I'm not going to try and explain that beyond saying it is the nature of modular arithmetic; but by way of evidence I offer this. The standard PRNG used in perl 5.10 on windows is the notorious MSC rand that has only 15-bits: 0-32767.

    However, if you use F-Y to shuffle any size array with less that 32767 elements; no matter a how many times you do it (within the bounds of reasonable values: say a human lifetime) then you will not detect any bias using simple std deviations or other simple correlation tools.

    Eg. Run this code with any $L < 32767, and any $N, and try to find some bias using your test and you will fail:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; our $N //= 1e6; our $L //= 50; my %counts; ++$counts{ int( rand $L ) } for 1 .. $N; pp \%counts;
    C:\test>junk77 -L=5 -N=1e7 { "0" => 1999935, 1 => 2000440, 2 => 1999682, 3 => 2001882, 4 => 19980 +61 } C:\test>junk77 -L=5 -N=1e7 { "0" => 1999465, 1 => 2000290, 2 => 1999884, 3 => 1999629, 4 => 20007 +32 } C:\test>junk77 -L=5 -N=1e7 { "0" => 1999025, 1 => 1999024, 2 => 2000250, 3 => 1999085, 4 => 20026 +16 } C:\test>junk77 -L=5 -N=1e7 { "0" => 1999941, 1 => 2001174, 2 => 1998446, 3 => 1999105, 4 => 20013 +34 } C:\test>junk77 -L=5 -N=1e7 { "0" => 1998594, 1 => 1999564, 2 => 2002043, 3 => 2000208, 4 => 19995 +91 } C:\test>junk77 -L=5 -N=1e8 { "0" => 19998201, 1 => 20012390, 2 => 19994928, 3 => 19998284, 4 => 1 +9996197 } C:\test>junk77 -L=50 -N=1e7 { "0" => 199673, 1 => 200117, 2 => 199619, 3 => 200785, 4 => 199947, 5 => 201072, 6 => 200075, 7 => 200212, 8 => 200173, 9 => 200781, 10 => 199524, 11 => 200163, 12 => 199981, 13 => 200973, 14 => 200483, 15 => 199633, 16 => 199506, 17 => 200081, 18 => 199572, 19 => 200733, 20 => 198890, 21 => 200602, 22 => 199665, 23 => 199819, 24 => 199935, 25 => 199939, 26 => 199868, 27 => 199960, 28 => 199116, 29 => 199926, 30 => 200444, 31 => 200205, 32 => 199426, 33 => 199787, 34 => 199578, 35 => 199312, 36 => 200249, 37 => 199743, 38 => 201357, 39 => 200411, 40 => 200164, 41 => 200179, 42 => 199436, 43 => 199302, 44 => 200279, 45 => 199640, 46 => 199267, 47 => 199733, 48 => 199664, 49 => 201001, }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re^2: Shuffling CODONS
by BrowserUk (Patriarch) on Jun 08, 2018 at 16:25 UTC

    Maybe this explains it better?

    The following program uses this heavily biased PRNG:

    sub badRand($) { rand() < 0.5 ? 0 : rand( $_[0] ) }

    which is designed to produce 0, 50% of the time. That means that when asked for numbers 0-9, it produces '0', 10 times more often than any of the other values:

    [549637, 50008, 50261, 50195, 50365, 49871, 50000, 49692, 49768, 50203 +]

    And when asked for 0..23, produces '0' 25 times more often than any other number:

    [ 521654, 20747, 20716, 20614, 20914, 20740, 20906, 20580, 20625, 2107 +7, 20736, 20912, 20835, 20820, 20958, 20818, 20866, 20818, 20969, 206 +46, 20749, 20953, 20454, 20893, ]

    But when that biased PRNG used within a standard Fisher Yates shuffle, the shuffles are fair, no matter how many times you run it:

    { ABCD => 67436, ABDC => 67742, ACBD => 67990, ACDB => 68277, ADBC => 68018, ADCB => 67846, BACD => 67766, BADC => 67859, BCAD => 68598, BCDA => 68919, BDAC => 69172, BDCA => 67870, CABD => 67755, CADB => 67192, CBAD => 67657, CBDA => 67731, CDAB => 67793, CDBA => 67680, DABC => 67755, DACB => 68016, DBAC => 68098, DBCA => 67988, DCAB => 67666, DCBA => 68408, }

    The test code:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; sub badRand($) { rand() < 0.5 ? 0 : rand( $_[0] ) } sub shuffle { $a = $_ + badRand( @_ - $_ ), $b = $_[$_], $_[$_] = $_[$ +a], $_[$a] = $b for 0 .. $#_; return @_; } my @rands; ++$rands[ badRand( 10 ) ] for 1 .. 1e6; pp \@rands; <STDIN> +; my @vals = ( 'A' .. 'D' ); my %tests; my $c = 0; while( 1 ) { ++$tests{ join '', shuffle( @vals ) }; unless( ++$c % 576 ) { system 'cls'; pp \%tests; } } __END__ c:\test> junk77 [549637, 50008, 50261, 50195, 50365, 49871, 50000, 49692, 49768, 50203 +] { ABCD => 67436, ABDC => 67742, ACBD => 67990, ACDB => 68277, ADBC => 68018, ADCB => 67846, BACD => 67766, BADC => 67859, BCAD => 68598, BCDA => 68919, BDAC => 69172, BDCA => 67870, CABD => 67755, CADB => 67192, CBAD => 67657, CBDA => 67731, CDAB => 67793, CDBA => 67680, DABC => 67755, DACB => 68016, DBAC => 68098, DBCA => 67988, DCAB => 67666, DCBA => 68408, }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      But when that biased PRNG used within a standard Fisher Yates shuffle, the shuffles are fair, no matter how many times you run it

      I am not convinced.

      A statistical test for assessing statistical fairness is chi-square: throw a dice a million times and feed chi-square with the counts for each outcome: '1' => ..., '2' => ... , ... '6' => .... It will tell you whether it thinks the dice is fair or not.

      So I run your test code with 3 rand-subs:

      use Math::Random::MT; my $mt = Math::Random::MT->new(); sub badRand($) { rand() < 0.5 ? 0 : rand( $_[0] ) } sub goodRand($) { rand($_[0]) } sub bestRand($) { $mt->rand($_[0]) }

      Then I run the chi-squared test (Statistics::ChiSquare) on the counts of the shuffle (re: the output of your test code).

      This is the result:

      bad_shuffle : There's a <1% chance that this data is random. 
      good_shuffle : There's a >25% chance, and a <50% chance, that this data is random.
      best_shuffle : There's a >25% chance, and a <50% chance, that this data is random.

      or

      bad_shuffle : There's a <1% chance that this data is random.
      best_shuffle : There's a >50% chance, and a <75% chance, that this data is random.
      good_shuffle : There's a >10% chance, and a <25% chance, that this data is random.
      

      or

      bad_shuffle : There's a <1% chance that this data is random.
      best_shuffle : There's a >25% chance, and a <50% chance, that this data is random.
      good_shuffle : There's a >50% chance, and a <75% chance, that this data is random.
      

      The above test consistently rates "bad_shuffle" based on badRand() as producing data with less than 1% chance of being random. It is undecided, let's say, which is best for this particular shuffle: Mersenne or Perl.

      The results your test code produced may look fair but they aint according to this particular test - it very much depends on use-case (see end of post). Although you may be right with your hunch that Mersenne may not offer anything more than perl's standard rand() for FY shuffle but hey let's not exaggerate with badRand()! :)

      This is my test code:

      sidenote:

      A particular shuffle may be appropriate for a particular situation.

      Such as?

      1. poker and co where patterns in the shuffled cards may influence the game. The shuffled cards must be tested for the number of patterns they contain.
      2. medical trials : a list of "same-category" (e.g. all healthy) patients must be shuffled in order to then separate them to different groups for trial treatment.
      3. related to #2 : I have a group with certain average properties. Then I group them randomly, after a shuffle, does each group has on average the same properties? Or have I grouped together the tall boys and put in another group the short boys, because my shuffle algorithm was not suitable?
        Okay. I ran your code again and got:
        C:\test>junk123 best_shuffle : There's a <1% chance that this data is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >5% chance, and a <10% chance, that this data +is random.
        That's not right!

        So, I though about my example code and looked at what it was intended to demonstrate.

        That despite the use of a completely bogus rand() function, a Fisher-Yates shuffle would still operate; and produce results:

        1. That all possible shuffles of the data were being produced. I chose to shuffle 4 values because the 24 possible results fit on a screen and are simple to verify manually.
        2. That they were produced with (approximately) the same frequency. Ie. The number of times each possible shuffle was produced were approximately equal and approximately 1/24th of the total runs.

        In that respect, it served its purpose.

        But, if you are going to formally test a shuffle, using only 4 value arrays and 1e6 iterations probably isn't the ideal scenario to test.

        To that end I tweaked your code to allow me to adjust both parameters from the command line:

        our $NUMTESTS //= 1e6; our $ASIZE //= 4; ... my @vals = ( 1..$ASIZE );

        And, given that Chi2 is generally used to determine whether a (smallish) sample is representative of a large and thus unknown population, I tried using a (moderately) larger array:

        C:\test>junk123 -ASIZE=40 -N=1e5 best_shuffle : I can't handle 100000 choices without a better table. good_shuffle : I can't handle 100000 choices without a better table. bad_shuffle : I can't handle 100000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e4 best_shuffle : I can't handle 10000 choices without a better table. good_shuffle : I can't handle 10000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e3 best_shuffle : I can't handle 1000 choices without a better table. good_shuffle : I can't handle 1000 choices without a better table. bad_shuffle : I can't handle 1000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >99.5% chance, and a <100% chance, that this +data is random. good_shuffle : There's a >99.5% chance, and a <100% chance, that this +data is random. bad_shuffle : There's a >99.5% chance, and a <100% chance, that this d +ata is random.

        And once I found a sample size of that larger array that the module could handle, did a few "identical" runs:

        C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a <1% chance that this data is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >5% chance, and a <10% chance, that this data +is random. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >10% chance, and a <25% chance, that this data + is random.

        Hm. Not exactly confidence inspiring.

        Let's try some middle ground:

        C:\test>junk123 -ASIZE=11 -N=1e5 best_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. good_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e5 best_shuffle : There's a >25% chance, and a <50% chance, that this dat +a is random. good_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. good_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. bad_shuffle : There's a >75% chance, and a <90% chance, that this data + is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >10% chance, and a <25% chance, that this dat +a is random. good_shuffle : There's a >5% chance, and a <10% chance, that this data + is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e2 best_shuffle : There's a >10% chance, and a <25% chance, that this dat +a is random. good_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e2 best_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. good_shuffle : There's a >90% chance, and a <95% chance, that this dat +a is random. bad_shuffle : There's a >1% chance, and a <5% chance, that this data i +s random.

        Sure, it's guessing that the known, deliberately really bad rand is producing poor results most of the time, but its also making the same guess about the known good rand with surprisingly high frequency.

        Two possibilities:

        1. the test is implemented badly;
        2. it is the wrong test for this kind of data.

        I suspect a little of both is at work here.

        I'm far from an expert on stats, but I don't believe that Chi2 is the right test for the kind of samples this produces; and I cannot see any reference to Yates correction in the module.

        More later.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        I am not convinced. A statistical test for assessing statistical fairness is chi-square:

        Well yes. Chi2 is the "standard test" in certain circles; but it needs

        1. to be implemented properly;
        2. to include Yates' (of Fisher-Yates fame) correction;
        3. to be used correctly;

        I haven't investigated the particular implementation you used, but I did install it so I could run your code. A few times...

        C:\test>junk123 best_shuffle : There's a >25% chance, and a <50% chance, that this dat +a is random. good_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 best_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. good_shuffle : There's a >25% chance, and a <50% chance, that this dat +a is random. bad_shuffle : There's a >50% chance, and a <75% chance, that this data + is random. C:\test>junk123 best_shuffle : There's a >10% chance, and a <25% chance, that this dat +a is random. good_shuffle : There's a >5% chance, and a <10% chance, that this data + is random. bad_shuffle : There's a >50% chance, and a <75% chance, that this data + is random.

        In 3 runs, it evaluates the probability that the results from (one of) the best PRNGs available, is actually random, varies between 10% and 90%, whilst that of a deliberately buggered PRNG is between 1% and 75%.

        Honestly, those 3 sets of results look as good an approximation of random as I can think of.

        I'll investigate whether it is the way you are using it; the implementation, or the potted interpretations it is producing later, once I've woken properly; but at first glance, I'd have to say "somethings not right".

        (May be we could feed S::CS its own results and see it it judges itself random :) )


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216195]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2025-11-11 05:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (67 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.