Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^5: About List::Util's pure Perl shuffle()

by BrowserUk (Pope)
on Jul 12, 2007 at 16:08 UTC ( #626269=note: print w/ replies, xml ) Need Help??


in reply to Re^4: About List::Util's pure Perl shuffle()
in thread About List::Util's pure Perl shuffle()

More and more interesting. But then, since all the other subs are (@), why not simply the following?

blazar++. Just checking if anyone is still paying attention :)

Since bukNew() doesn't need to copy the input list--because it shuffles, and therefore mutates, a list of indices generated internally, rather than mutating a copy of the input--it seemes silly to pass arrays in by value forcing their duplication. Most uses of shuffle() are applied to pre-existing arrays. That's why we make a copy of them, or create a list of aliases to them in most versions--simply to avoid the shuffle from modifying the external arrays. So, instead of

my @array = ...; ... my @shuffled = shuffle @array;

we can avoid some copying by using

my @array = ... ... my @shuffled = shuffle \@array;

And when we want to shuffle a list, instead of

my @shuffled = shuffle ... some expression generating a list ...;

we use

my @shuffled = shuffle [ ... some expression generating a list ... ];

Inspired by blokhead's version, I went back to basics and tried benchmarking a simple version that took a reference rather thn a list:bukNew() with quite surprising results.

Of course, the same lesson can now be retrofitted to those other routines that don't need to replicate the input list for their operation. In particular, blokhead's version!

And here are the headline results of doing that (as blokhead_ref):

Notice that the length of the strings is not a factor for blokhead(_ref) or bukNew!

10 strings length 10 Rate naive listutil blokhead_ref buk blokhea +d bukNew naive 20559/s -- -44% -69% -70% -70 +% -74% listutil 36857/s 79% -- -44% -47% -47 +% -53% blokhead_ref 66375/s 223% 80% -- -4% -5 +% -16% buk 69370/s 237% 88% 5% -- -0 +% -12% blokhead 69635/s 239% 89% 5% 0% - +- -12% bukNew 79050/s 285% 114% 19% 14% 14 +% -- 10 strings length 1000 Rate naive listutil blokhead_ref blokhead bu +k bukNew naive 1266/s -- -60% -98% -98% -98 +% -98% listutil 3164/s 150% -- -95% -95% -95 +% -96% blokhead_ref 67220/s 5209% 2024% -- -1% -3 +% -16% blokhead 67878/s 5261% 2045% 1% -- -2 +% -15% buk 69084/s 5356% 2083% 3% 2% - +- -14% bukNew 80288/s 6241% 2437% 19% 18% 16 +% -- 100 strings length 1000 Rate naive listutil buk blokhead blokhead_ref + bukNew naive 93.7/s -- -58% -99% -99% -99% + -99% listutil 225/s 141% -- -97% -97% -97% + -98% buk 7540/s 7946% 3244% -- -5% -7% + -23% blokhead 7919/s 8349% 3412% 5% -- -2% + -19% blokhead_ref 8105/s 8547% 3494% 7% 2% -- + -18% bukNew 9831/s 10390% 4260% 30% 24% 21% + -- 1000 strings length 1000 Rate naive listutil buk blokhead blokhead_ref + bukNew naive 8.16/s -- -63% -99% -99% -99% + -99% listutil 22.1/s 170% -- -97% -97% -97% + -97% buk 737/s 8932% 3240% -- -8% -10% + -11% blokhead 804/s 9750% 3542% 9% -- -2% + -3% blokhead_ref 819/s 9927% 3608% 11% 2% -- + -1% bukNew 826/s 10023% 3643% 12% 3% 1% + --

The significant thing is that blokhead_ref moves ahead of blokhead very quickly as the number of elements increases. Avoiding copying the array pays dividends very quickly.

However some random tests with various string and list lengths seem to show that it could be considerably slower than buk(), by even as much as 40% or so. Thus... are you sure your benchmark is not flawed?

If you look back over the thread, you'll see that there have been various instances where different people have seen different results from benchmarking apparently the same code.

Some of the differences are explained by whether the data being shuffled consists of all integer data (as in your original benchmark (0..1000 etc.), or string data as used by most people after ikegami noted the difference it makes. When an SV points to an IV, copying that SV is a faster operation than when it points to a PV. With the latter, a second memory allocation and memcpy operation have to be performed to copy the string data pointed at by the PV. If the SV contains integer (and probably float?), and has never been used in a string context, then the number will never have been ascii-ized and the PV will be null. That makes for significantly less work to copy it.

That doesn't explain all the anomolies seen above though I think? Anyway, it's quite possible my benchmark is flawed--it certainly wouldn't be the first time :)--so here is the code. Tell me what you think?

The benchmark code:

#!/usr/bin/perl -slw use strict; use List::Util qw[ shuffle ]; use Benchmark qw/:all/; sub naive (@) { my @l=@_; for (reverse 1..$#l) { my $r=int rand($_+1); @l[$_,$r]=@l[$r,$_]; } @l; } sub listutil (@) { my @a=\(@_); my $n; my $i=@_; map { $n = rand($i--); (${$a[$n]}, $a[$n] = $a[$i])[0]; } @_; } sub buk (@) { my @a = \( @_ ); my $n; my $i = @_; map+( $n = rand($i--), ${ $a[ $n ] }, $a[ $n ] = $a[ $i ] )[ 1 ], +@_; } sub bukNew ($) { my( $ref ) = @_; my @x = 0 .. $#$ref; @{ $ref }[ map splice( @x, rand @x, 1 ), @x ]; } #my %stats; #++$stats{ join' ', bukNew( ['A'..'D'] ) } for 1 .. 1e4; #print "$_ => $stats{ $_ }" for sort keys %stats; sub blokhead (@) { my @a = (0 .. $#_); my $i = @_; my $n; map+( $n=rand($i--), $_[$a[$n]], $a[$n]=$a[$i] )[ 1 ], @_; } sub blokhead_ref ($) { my( $ref ) = @_; my @a = (0 .. $#$ref); my $i = @$ref; my $n; map+( $n=rand($i--), $ref->[$a[$n]], $a[$n]=$a[$i] )[ 1 ], @a; } #%stats = (); #++$stats{ join' ', blokhead_ref( ['A'..'D'] ) } for 1 .. 1e4; #print "$_ => $stats{ $_ }" for sort keys %stats; for my $c ( map{ 10**$_ } 1..3 ) { for my $l ( map{ 10**$_ } 0.1, 1, 3 ) { print "\n$c strings length $l"; our @test = map $_ x $l, 1..$c; cmpthese -3, { bukNew => q[ bukNew( \@test ) ], blokhead => q[ blokhead_ref( \@test ) ], map { $_ => "$_ \@test" } qw/naive listutil blokhead buk/ }; } }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: About List::Util's pure Perl shuffle()
Select or Download Code
Re^6: About List::Util's pure Perl shuffle()
by blazar (Canon) on Jul 13, 2007 at 13:29 UTC
    blazar++. Just checking if anyone is still paying attention :)

    Still paying attention and also advertising in clpmisc (link @ GG). So BrowserUk++ for the additional insight and the many details provided in this reply of yours.

    Since bukNew() doesn't need to copy the input list--because it shuffles, and therefore mutates, a list of indices generated internally, rather than mutating a copy of the input--it seemes silly to pass arrays in by value forcing their duplication. Most uses of shuffle() are applied to pre-existing arrays. That's why we make a copy of them, or create a list of aliases to them in most versions--simply to avoid the shuffle from modifying the external arrays.

    Well, but then that is somewhat obvious, and even the naive implementation would would be better apt at shuffling an array in place, and would thus better be rewritten like this:

    sub naive (@) { for (reverse 1..$#_) { my $r=int rand($_+1); @_[$_,$r]=@_[$r,$_]; } }

    (Not that this would make it really faster - I checked and it's still considerably slower than the alternatives presented here.)

    But then I think that in any case you would be comparing apples and bananas: for it was somewhat clear that the interface was in terms of "accept a list, return a (shuffled) list"... unless the interface itself was also part of the choice to take. OTOH your tests themselves show that blokhead's solution performs best on large datasets, which is exactly were speed is more likely to matter (well, not strictly, since one may have 10^6 shuffles of 10 elements long lists to do...) - and it has a more intuitive interface, so I would go for it. More precisely I have "decided" that my favourite version is the following modification of blokhead() that can be considered a "hybrid" with another version of yours:

    sub bzr (@) { my @a = 0..$#_; my $i = @_; my $n; map +($_[$a[$n=rand($i--)]], $a[$n]=$a[$i])[0], @_; }

    I must say that I've included it in the benchmarks too and while it doesn't perform well compared to blokhead(), buk() and bukNew() in the short list tests, it is comparable with the best on the 1000 strings one, and in two of them it even performed best of all, although I'd rather consider this to be a fluctuation. Why this is so, anyway, is beyond my comprehension.

    In the meanwhile, I received a followup from Peter J. Holzer in clpmisc, which addresses my own post (a copy of the root node here) and I'm also reporting herafter in its entirey for completeness.


    Peter J. Holzer's reply

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://626269]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2014-07-29 09:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (212 votes), past polls