http://www.perlmonks.org?node_id=177092

These two snippets implement Algortihm S(3.4.2) and Algortihm R(3.4.2) from Knuth's Art Of programming.

The first randomly selects N items from an array of elements, and returns a reference to an array containing the elements. Note that it will not necessarily consider all of the elements in the list.

The second randomly selects N items from a file of indeterminate size and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)

The usual caveats about randomness in computers in general and the randomness of perls rand() function apply to these routines.

NOTE The routines are not as efficient as they could be. Instead of considering a record and determining if the record should be included in the result set, they could be modified to determine how many elements to skip before including a record. That modification is left to the reader.

Any errors are almost certainly in my interpretation of Knuths algortihm, and not in the actual technique.

<strict>UPDATE</strict>And im so glad I put in the above caveat. ;-) Anonymonk is right, I posted an incorrect version first time. Fixed.

Demerphq

# From Knuth Art of Programming # Algortihm S(3.4.2) # Select n records at random from a set of N records where # 0<n<=N # # This algortihm is only useful when N is known in advance. # If n=2 then the average number of elements considered is # 2/3*N. the General formula is (N+1)n/(n+1) # Its possible to optimise this even more. # # In this case we use $array a reference to an array of items # and $num for the number of elements we want, we return an # array of elements # # It should be remembered that this will be as random as # the random number generator being used. sub selection_sample { my ($array,$num)=@_; die "Too few elements (".scalar(@$array).") to select $num from\n" unless $num<@$array; my @result; my $pos=0; while (@result<$num) { $pos++ while (rand(@$array-$pos)>($num-@result)); push @result,$array->[$pos++]; } return \@result } # From Knuth Art of Programming # Algortihm R(3.4.2) # first argument is a filehandle. Second argument is the desired # number of records in the sample # Will die if there are insufficeient records in the file. # Returns a reference to an array of the selected records. sub reservoir_sample { my ($file,$num)=@_; my @buffer; while ( <$file> ) { chomp; push @buffer,$_; last if @buffer==$num; } die "Insufficient records\n" if @buffer<$num; my $pos=@buffer; while ( <$file> ) { $pos++; my $rand=rand($pos); if ($rand<@buffer) { chomp; $buffer[$rand]=$_; } } return \@buffer; }

Replies are listed 'Best First'.
Re: Random Sampling
by ferrency (Deacon) on Jun 25, 2002 at 20:18 UTC
    Those are very interesting algorithms, which I was not familiar with. Thank you very much for implementing them.

    However, no offense to Mr. Knuth or his S algorithm, but selecting n unique elements from an array containing N elements only needs to consider n elements. The solution below is based on a linear shuffle algorithm I saw on perlmonks. Instead of shuffling all N elements of the array, I instead shuffle the first n elements and return only those shuffled.

    The code:

    sub selection_sample { my ($array, $n) = @_; my @result; die "Too few elements (". scalar @$array .") to select $n from\n" unless $n <= @$array; my %i; for (0..$n - 1) { my $r = int($_ + rand(@$array - $_)); push @result, $array->[defined($i{$r}) ? $i{$r} : $r]; $i{$r} = defined($i{$_}) ? $i{$_} : $_; } return \@result; }
    The basic operation of the linear shuffle is: For each element in the array, randomly choose any element from this element up to the end of the array, and swap the two elements. Please see the original linear shuffle code for a clear demonstration of how to shuffle an entire array.

    In the code above, I use a shadow hash %i to store the indices of the shuffled elements of $array. I do this so I can choose a random set from $array without changing the contents of $array, and without copying the entire array. I use a hash %i and defined() tests instead of having to initialize an entire array @i with (0..@$array-1).

    I'm not sure of the original requirements of the algorithm, but even using a shadow hash instead of copying the array, my code does use more memory than Knuth's S algorithm. This may not be acceptable in all cases, but if it isn't, then you probably shouldn't be using perl :)

    Depending on your requirements, it may be faster or more efficient to chop out the optimizations, which only make sense for a large $array compared to $n (and/or, if you don't want the array to actually be shuffled).

    Thanks for the good and inspiring post!

    Alan

    Update: Ah, your point is well taken: In the context of any data storage which allows only linear access to the records, Knuth's algorithm is definitely the better solution. I'm sorry I seemed to have missed this point.

    I do very much like the second algorithm you presented, probably because I didn't miss the point of that one :) I think that most of the time that would be the more appropriate algorithm for my uses when I'm dealing with file access. "Select N random lines from a file" seems like a common task which this would solve efficiently. Most of the time I'm not going to know how many lines I have in the file without counting, which defeats the purpose.

      Well, to be honest I think your criticism should be aimed at the way I presented the algorithm (mostly as "proof of concept") than the algorithm itself.

      For instance it would work just fine for a picking N elelements from a file of known arbitrarily large size, regardless of fixed record or not, and of media restrictions (for instance it would work fine for data stroed on a tape).

      Whereas the algorithm you mention (and thanks :-) needs to have the full set in memory at one time, or efficient random access to the records as stored on some form of fixed media (which afaict would require fixed record lengths).

      I realize this criticism applys to my implementation as well, since I used an array. Clearly I shouldnt have as it distracts from the point I was making. :-)

      Thanks for the comments though, added value for the thread for sure.

      Update
      Dont be sorry, I should have explained in more detail. You know the old saying about how

      Ass u me

      is a bad thing...

      ;-)

      Yves / DeMerphq
      ---
      Writing a good benchmark isnt as easy as it might look.

Re: Random Sampling
by Anonymous Monk on Jun 25, 2002 at 20:22 UTC
    Your selection_sample routine is not correctly coded. The idea is to randomly select a sample of n elements as you go, there should not be duplicates in the sample. Change your while loop to the following for a more faithful representation of Knuth's algorithm.
    while ( @results < $num ) { if ( (@$array - $pos) * rand() < $num - @results ) { push @results, $array->[$pos]; } $pos++; }
      Hmm, while you are correct that my code was incorrect, I went with a different fix. My fix was simply to change the line
      push @results, $array->[$pos];
      to
      push @results, $array->[$pos++];
      But thanks anyway. /me should have tested once or twice more. *sigh*

      Yves / DeMerphq
      ---
      Writing a good benchmark isnt as easy as it might look.