Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^5: Shuffling CODONS

by bliako (Abbot)
on Jun 10, 2018 at 11:41 UTC ( [id://1216316]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Shuffling CODONS
in thread Shuffling CODONS

I'm far from an expert on stats, but I don't believe that Chi2 is the right test for the kind of samples this produces; and I cannot see any reference to Yates correction in the module.

Me neither, far from expert. Beyond chi-squared there are tests for finding a pattern in the shuffled data (I once tried to zip a file and count its compression ration achieved as proportional to how random the sequence is), and monte-carlo-splitting the shuffled array in groups and trying to see if the group's average approaches the total array's average.

I will report back chi-squared results using R

That despite the use of a completely bogus rand() function, a Fisher-Yates shuffle would still operate; and produce results:

    That all possible shuffles of the data were being produced. I chose to shuffle 4 values because the 24 possible results fit on a screen and are simple to verify manually.
    That they were produced with (approximately) the same frequency. Ie. The number of times each possible shuffle was produced were approximately equal and approximately 1/24th of the total runs. 

In that respect, it served its purpose.

But, if you are going to formally test a shuffle, using only 4 value arrays and 1e6 iterations probably isn't the ideal scenario to test. 

ok

Replies are listed 'Best First'.
Re^6: Last comment (and nail in the coffin of) of S::CS :)
by BrowserUk (Patriarch) on Jun 10, 2018 at 13:54 UTC

    Just for fun I made a copy of the chisquare() function in S::CS, that returns the number it calculates, rather than its string assessment of what that value means:

    sub chisquaredVal { my @data = @_; @data = @{$data[0]} if(@data == 1 and ref($data[0])); return "There's no data!" if(!@data); return "Not enough data!" if(@data == 1); # return "Malformed data!" if(grep { /\D/ } @data); my $degrees_of_freedom = scalar(@data) - 1; my ($chisquare, $num_samples, $expected, $i) = (0, 0, 0, 0); if (! ref($chitable[$degrees_of_freedom])) { return "I can't handle ".scalar(@data)." choices without a bet +ter table."; } foreach (@data) { $num_samples += $_ } $expected = $num_samples / scalar(@data); # return "There's no data!" unless $expected; foreach (@data) { $chisquare += (($_ - $expected) ** 2) / $expected; } return $chisquare; }

    I also had to disable the "Malformed data" tests, as they reject floating point numbers!)

    Then I wrote a script that repeated the test a hundred times on the F-Y shuffle using MT rand(), and gathered the numerical values it produces into an array:

    #! perl -slw use strict; use Statistics::ChiSquare qw[ chisquare chisquaredVal ]; use Math::Random::MT; use Data::Dump qw[ pp ]; my $mt = Math::Random::MT->new(); our $N //= 1e6; our $ASIZE //= 4; our $T //= 4; sub shuffle { $a = $_ + $mt->rand( @_ - $_ ), $b = $_[$_], $_[$_] = $_[$a], $_[$a] = $b for 0 .. $#_; return @_; } my @data = ( 1 .. $ASIZE ); my @chi; for( 1 .. $T ) { my %tests; ++$tests{ join '', shuffle( @data ) } for 1 .. $N; push @chi, chisquaredVal( values %tests ); }

    And finally, ran the original chisquare() function on its own output for assessment:

    print chisquare( @chi ); __END__ C:\test>chiSquareChiSquare -ASIZE=7 -N=2e2 -T=1e2 There's a >99.5% chance, and a <100% chance, that this data is random. C:\test>chiSquareChiSquare -ASIZE=7 -N=2e2 -T=1e2 There's a >90% chance, and a <95% chance, that this data is random. C:\test>chiSquareChiSquare -ASIZE=7 -N=2e2 -T=1e2 There's a >99.5% chance, and a <100% chance, that this data is random.

    Finally, I think it got something right!

    Its own output is truly random :)

    (Sorry. I was bored :) )


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      Fine. See also calculate Chi-Square test

      Back to business. I tried delegating the chi-squared test to R using:

      use Statistics::R; sub do_chi_squared_test_in_R { my $freqs = $_[0]; # arrayref of counts/frequencies for each c +hoice my $R = Statistics::R->new(); # should really re-use R obj but + let's not complicated things now. $R->send('library(MASS)'); $R->set('freqs', $freqs); $R->run('res <- chisq.test(as.table(freqs))'); my $chisquare = $R->get('res$statistic'); # single quotes plea +se my $pvalue = $R->get('res$p.value'); $R->stop(); print "do_chi_squared_test_in_R : ".join(",",@$freqs)." : x2=$ +chisquare, pv=$pvalue\n"; # returns [chi-squared-statistic, p-value] return [$chisquare, $pvalue] }

      I have also modified your chisquaredVal() to return the statistic as it does, along with an indicative p-value.

      my $pvalue = -1; foreach (@{$Statistics::ChiSquare::chitable[$degrees_of_freedo +m]}) { if ($chisquare < $_) { $pvalue = $Statistics::ChiSquare::chilevels[$degre +es_of_freedom]->[$i+1]; last } $i++; } $pvalue = (@{$Statistics::ChiSquare::chilevels[$degrees_of_fre +edom]})[-1] unless $pvalue > -1; $pvalue /= 100; # 0-1 e.g. 5% -> 0.05

      Then, I modified your script for doing a chi-square test using R as well. These are some results:

      ./browser_uk_test_script.pl : number of cases with bias detected ./browser_uk_test_script.pl : standard-rand (using R) -> number of bia +s detected: 1 / 20 (pvalues=,0.1063036,0.4628186,0.01704211,0.45147,0.6672964,0.1 +163614,0.4996122,0.06030822,0.764618,0.08935117,0.1812506,0.1615832,0 +.3479303,0.9551496,0.1814904,0.1601532,0.3921217,0.9140581,0.621188,0 +.6104723) ./browser_uk_test_script.pl : standard-rand (using Perl) -> number of +bias detected: 1 / 20 (pvalues=,0.1,0.25,0.01,0.25,0.5,0.1,0.25,0.05,0.75,0.05,0.1,0 +.1,0.25,0.95,0.1,0.1,0.25,0.9,0.5,0.5) -------------------- ./browser_uk_test_script.pl : MT-rand (using R) -> number of bias dete +cted: 2 / 20 (pvalues=,0.8734286,0.4596782,0.0376436,0.483486,0.3135795,0.4 +501071,0.3534776,0.6427224,0.1530366,0.1850991,0.7265003,0.6379416,0. +9933123,0.01561197,0.7942019,0.2008258,0.8592597,0.2874155,0.211755,0 +.6454129) ./browser_uk_test_script.pl : MT-rand (using Perl) -> number of bias d +etected: 2 / 20 (pvalues=,0.75,0.25,0.01,0.25,0.25,0.25,0.25,0.5,0.1,0.1,0.5,0 +.5,0.99,0.01,0.75,0.1,0.75,0.25,0.1,0.5) -------------------- ./browser_uk_test_script.pl : bad-rand (using R) -> number of bias det +ected: 16 / 20 (pvalues=,0.00362283,6.262224e-05,4.627376e-06,1.718484e-06,0. +07608554,0.002237529,0.01988714,0.01380236,0.07770095,0.04196283,0.01 +394789,6.257283e-05,0.608835,0.00027414,0.09803577,0.01257798,0.00543 +3616,0.02350288,0.00299678,0.004870376) ./browser_uk_test_script.pl : bad-rand (using Perl) -> number of bias +detected: 16 / 20 (pvalues=,0.01,0.01,0.01,0.01,0.05,0.01,0.01,0.01,0.05,0.01,0. +01,0.01,0.5,0.01,0.05,0.01,0.01,0.01,0.01,0.01) --------------------

      once more:

      ./browser_uk_test_script.pl : number of cases with bias detected ./browser_uk_test_script.pl : standard-rand (using R) -> number of bia +s detected: 0 / 20 (pvalues=,0.5337195,0.5921474,0.1788502,0.7560645,0.1524058,0. +9634717,0.3517389,0.1133675,0.8631978,0.1689414,0.1829178,0.9446986,0 +.7018559,0.3831006,0.422912,0.2526592,0.6182125,0.7644635,0.4913882,0 +.1258621) ./browser_uk_test_script.pl : standard-rand (using Perl) -> number of +bias detected: 0 / 20 (pvalues=,0.5,0.5,0.1,0.75,0.1,0.95,0.25,0.1,0.75,0.1,0.1,0.9, +0.5,0.25,0.25,0.25,0.5,0.75,0.25,0.1) -------------------- ./browser_uk_test_script.pl : bad-rand (using R) -> number of bias det +ected: 12 / 20 (pvalues=,0.06574112,0.00283819,6.942476e-06,5.380104e-05,1.69 +7578e-05,0.001277564,0.01886317,0.0003957457,8.826589e-06,0.07862541, +0.0001284624,0.2741576,0.1136995,0.1550829,0.007600639,0.00122368,0.0 +001251429,0.1258741,0.234761,0.1011256) ./browser_uk_test_script.pl : bad-rand (using Perl) -> number of bias +detected: 12 / 20 (pvalues=,0.05,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.05,0. +01,0.25,0.1,0.1,0.01,0.01,0.01,0.1,0.1,0.1) -------------------- ./browser_uk_test_script.pl : MT-rand (using R) -> number of bias dete +cted: 1 / 20 (pvalues=,0.5276704,0.06869204,0.3610061,0.9570774,0.0468298,0 +.514135,0.3893473,0.9621195,0.8655001,0.4279424,0.2365762,0.6020179,0 +.2011017,0.7721139,0.310637,0.3746697,0.2099162,0.1343461,0.7025518,0 +.2929772) ./browser_uk_test_script.pl : MT-rand (using Perl) -> number of bias d +etected: 1 / 20 (pvalues=,0.5,0.05,0.25,0.95,0.01,0.5,0.25,0.95,0.75,0.25,0.1, +0.5,0.1,0.75,0.25,0.25,0.1,0.1,0.5,0.25) --------------------

      once more:

      ./browser_uk_test_script.pl : number of cases with bias detected over +20 tests ./browser_uk_test_script.pl : MT-rand (using R) -> number of bias dete +cted: 0 / 20 (pvalues=,0.4869334,0.4961692,0.8251134,0.2657584,0.84692,0.12 +96479,0.504212,0.9028209,0.8082847,0.2999607,0.154672,0.1660518,0.514 +3663,0.8120685,0.4452244,0.6561128,0.6123136,0.6994308,0.9302561,0.47 +57345) ./browser_uk_test_script.pl : MT-rand (using Perl) -> number of bias d +etected: 0 / 20 (pvalues=,0.25,0.25,0.75,0.25,0.75,0.1,0.5,0.9,0.75,0.25,0.1,0 +.1,0.5,0.75,0.25,0.5,0.5,0.5,0.9,0.25) -------------------- ./browser_uk_test_script.pl : standard-rand (using R) -> number of bia +s detected: 2 / 20 (pvalues=,0.3855934,0.4056266,0.9905815,0.9134483,0.007583821, +0.374439,0.3795054,0.4661071,0.08190948,0.617658,0.9640541,0.9402085, +0.883329,0.517288,0.04139933,0.3008889,0.4068793,0.963351,0.8351011,0 +.7187029) ./browser_uk_test_script.pl : standard-rand (using Perl) -> number of +bias detected: 2 / 20 (pvalues=,0.25,0.25,0.99,0.9,0.01,0.25,0.25,0.25,0.05,0.5,0.95 +,0.9,0.75,0.5,0.01,0.25,0.25,0.95,0.75,0.5) -------------------- ./browser_uk_test_script.pl : bad-rand (using R) -> number of bias det +ected: 12 / 20 (pvalues=,0.0001945661,0.5148868,4.572803e-05,0.08857331,0.000 +2981317,0.1119708,0.09084117,0.005959205,0.2835337,5.089226e-08,0.000 +9614042,0.2044863,0.004434155,0.002847202,0.01542375,0.0105589,0.0026 +59498,0.1273001,0.07932075,4.602725e-05) ./browser_uk_test_script.pl : bad-rand (using Perl) -> number of bias +detected: 12 / 20 (pvalues=,0.01,0.5,0.01,0.05,0.01,0.1,0.05,0.01,0.25,0.01,0.01 +,0.1,0.01,0.01,0.01,0.01,0.01,0.1,0.05,0.01) --------------------

      My verdict

      Statistics::ChiSquared can be improved by also printing out a p-value (we accept bias only if the chi-square statistic is large and the p-value is below our confidence interval, say 0.05). However, its calculated statistic is more or less the same as with R's. And after my modification to return a p-value, its p-value is comparable with R's. See:

      do_chi_squared_test_in_R() : counts: 4317,4188,4099,4231,4258,4195,407 +6,4330,4135,4143,4152,4191,3995,4176,4113,4088,4171,4132,4115,4154,41 +65,4140,4246,4190 : x2=30.8192, pv=0.1273001 do_chi_squared_test_in_perl() : counts: 4317,4188,4099,4231,4258,4195, +4076,4330,4135,4143,4152,4191,3995,4176,4113,4088,4171,4132,4115,4154 +,4165,4140,4246,4190 : x2=30.8192, pv=0.1 do_chi_squared_test_in_R() : counts: 4032,4106,4154,4136,4234,4065,416 +6,4039,4279,4210,4031,4171,4279,4291,4132,4190,4281,4129,4179,4128,41 +88,4172,4175,4233 : x2=33.10112, pv=0.07932075 do_chi_squared_test_in_perl() : counts: 4032,4106,4154,4136,4234,4065, +4166,4039,4279,4210,4031,4171,4279,4291,4132,4190,4281,4129,4179,4128 +,4188,4172,4175,4233 : x2=33.10112, pv=0.05</pre>

      Exact same statistic (x2), comparable p-values.

      So, that tells me that Fisher-Yates shuffle, for this particular problem, is susceptible to RNG's sequence: different runs (same array, different seed) produce sometimes sane and sometimes bogus(biased) results. Anything but bad-rand (e.g. sub { rand() < 0.5 ? 0 : rand( $_[0] ) } should give a 0,5,10% chance of biased output. Bad-rand gives 50% chance more-or-less. However, there is not much difference between perl's rand() and MersenneT (for this particular use-case).

      That concludes my part in this long tangent and also settles that TODO of mine which so much bugged you :) and for good reason and good results and good fun I say.

      Final word of warning: this investigation is for only this particular problem BrowserUK came up with. Other use cases may be different.

      The final script to do this (NOTE: you need to change first 2 'my' to 'our' in Statistics/ChiSquare.pm):

        However, its calculated statistic is more or less the same as with R's. And after my modification to return a p-value, its p-value is comparable with R's.

        Hm. If that is the case(*) then given the complete instability of S::CS' verdict on F-Y with MT over successive runs on the same dataset and the same number of iterations, then I'd have to conclude that Chi2 isn't suitable for testing this algorithm and/or data.

        It would be interesting to see how consistent R's results when running the same data/iterations. If that also produced unstable results, that would be the proof it's the wrong test.

        I might look into trying to apply Fisher's Exact test to the problem tomorrow and see what that produces for F-Y with MT.

        I also have an idea about generating artificial datasets and feeding them to S::CS to determine the how big/small the differences are that cause it to switch from <1%, to 5% to 50% to >75% etc.; and then try to reason what affect those changes would have upon the 3 scenarios you postulated in your first post.

        Oh, and thank you for providing some interesting diversion for a boring Sunday:)

        *I'm not at all convinced that these two set of values are "more or less the same":

        (pvalues=,0.4869334,0.4961692,0.8251134,0.2657584,0.84692,0.1296479,0. +504212,0.9028209,0.8082847,0.2999607,0.154672,0.1660518,0.5143663,0.8 +120685,0.4452244,0.6561128,0.6123136,0.6994308,0.9302561,0.4757345) pvalues=,0.25,0.25,0.75,0.25,0.75,0.1,0.5,0.9,0.75,0.25,0.1,0.1,0.5,0. +75,0.25,0.5,0.5,0.5,0.9,0.25)
        1. The values from R have a much finer granularity; which if they come from the same dataset suggests some clumsy rounding or similar is going on in S:CS.
        2. In a rang of 0-1, I don't find 0.25 to be "more or less the same" as 0.4869 or 0.,4961.

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re^6: Shuffling CODONS
by BrowserUk (Patriarch) on Jun 10, 2018 at 12:46 UTC
    I will report back chi-squared results using R

    It will be interesting to see the results from a known good source.

    Because I think that S::CS is (fatally) flawed. To get some feel for the accuracy of the test it performs, I decided to run it on the shuffle using the known good MT PRNG and a small dataset (1..4) a good number of times to see how consistent the results S::CS were; and the answer is not just "not very", but actually just "not":

    #! perl -slw use strict; use Statistics::ChiSquare qw[ chisquare ]; use Math::Random::MT; use Data::Dump qw[ pp ]; my $mt = Math::Random::MT->new(); our $N //= 1e6; our $ASIZE //= 4; our $T //= 4; sub shuffle { $a = $_ + $mt->rand( @_ - $_ ), $b = $_[$_], $_[$_] = $_[$a], $_[$a] = $b for 0 .. $#_; return @_; } my @data = ( 1 .. $ASIZE ); my @chi; for( 1 .. $T ) { my %tests; ++$tests{ join '', shuffle( @data ) } for 1 .. $N; print chisquare( values %tests ); } __END__ C:\test>chiSquareChiSquare -ASIZE=4 -N=1e4 -T=100 There's a >25% chance, and a <50% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >10% chance, and a <25% chance, that this data is random. There's a >10% chance, and a <25% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >10% chance, and a <25% chance, that this data is random. There's a >25% chance, and a <50% chance, that this data is random. There's a >75% chance, and a <90% chance, that this data is random. There's a >25% chance, and a <50% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >5% chance, and a <10% chance, that this data is random. There's a >75% chance, and a <90% chance, that this data is random. There's a >10% chance, and a <25% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >75% chance, and a <90% chance, that this data is random. There's a >75% chance, and a <90% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >50% chance, and a <75% chance, that this data is random. There's a >5% chance, and a <10% chance, that this data is random. There's a >95% chance, and a <99% chance, that this data is random. There's a >1% chance, and a <5% chance, that this data is random.

    79 more utterly inconsistent results:

    Given this is a known good algorithm using a known good PRNG, all in all, and as I said earlier, I think that is as good a definition of random as I've seen a module produce as its results.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216316]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2025-11-10 22:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (67 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.