Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Randomize lines with limited memory (Roll your own...)

by BrowserUk (Pope)
on Nov 02, 2003 at 02:30 UTC ( #303866=note: print w/ replies, xml ) Need Help??


in reply to Randomize lines with limited memory

...or use mine:)

This is one of those cases where rolling your own has benefits. The FAQ, Cookbook and the List::Util version of the Fischer-Yates shuffle all use the copy semantics. This means that you need over double the space required to store the data, in order to shuffle it.

My version does an in-place shuffle, the benefits of which really show up when you start shuffling huge arrays. The following results are a comparision between my pure-perl, inplace shuffle and the List::Util XS version.

P:\test>test -N=2000000 Pre-allocation 2696 kb (Memory use noted manually) Pre-inplace shuffle 43068 kb Post-inplace shuffle 43076 kb Post-XS_copy shuffle 91192 kb 1 trial of Inplace (41.760s total) 1 trial of Copied (48.400s total)

The results show that my in-place version consumes just 8k extra ram to perform the shuffle, and takes about 15% less time to do it than the XS version. The XS version only takes around 15 seconds to actually perform the shuffle, but the copy semantics mean it loses this performance advantage by the need to allocate double the space, ending up considerably slower.

It wouldn't be that hard (if you are an an accomplished XS programmer) to re-cast the List::Util version to detect that it was being given an array reference and was being called in a void context and switch to an in-place algorithm. Some crude tests seem to show that this would not only halve the memory usage, but as a result, would cut the overall shuffle time to less than a third.

The benchmark (You'll need to use an external tool to measure the memory usage).

#! perl -slw use strict; use List::Util qw[ shuffle ]; use Benchmark::Timer; our $N ||= 1_000_000; sub my_shuffle (\@) { my( $aref, $x ) = shift; for my $y ( 0 .. $#{ $aref } ) { $x = $y + rand( @{ $aref } - $y ); @$aref[ $y, $x ] = @$aref[ $x, $y ]; } } my $timer = new Benchmark::Timer; my @array; print 'Pre-allocation'; <STDIN>; push @array, $_ for 1 .. $N; print 'Pre-inplace shuffle'; <STDIN>; $timer->start('Inplace'); my_shuffle @array; $timer->stop('Inplace'); print 'Post-inplace shuffle'; <STDIN>; $timer->start('Copied'); my @shuffled = shuffle @array; $timer->stop('Copied'); print 'Post-XS_copy shuffle'; <STDIN>; $timer->report; __END__ P:\test>test -N=2000000 Pre-allocation Pre-inplace shuffle Post-inplace shuffle Post-XS_copy shuffle 1 trial of Inplace (41.760s total) 1 trial of Copied (48.400s total)

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!


Comment on Re: Randomize lines with limited memory (Roll your own...)
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://303866]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-12-25 08:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls