CUFP
demerphq
snippet
<div class="Description">These two snippets implement Algortihm S(3.4.2) and Algortihm R(3.4.2) from Knuth's Art Of programming.<p>
The first randomly selects N items from an array of elements, and returns a reference to an array containing the elements. Note that it will not necessarily consider all of the elements in the list.<p>
The second randomly selects N items from a file <b>of indeterminate size</b> and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)<p>
<b><i>The usual caveats about randomness in computers in general and the randomness of perls rand() function apply to these routines.</b></i><p>
<b>NOTE</b> The routines are not as efficient as they could be. Instead of considering a record and determining if the record should be included in the result set, they could be modified to determine how many elements to skip before including a record. That modification is left to the reader.<p>
Any errors are almost certainly in my interpretation of Knuths algortihm, and not in the actual technique.<p>
<strict>UPDATE</strict>And im so glad I put in the above caveat. ;-) Anonymonk is right, I posted an incorrect version first time. Fixed.<p>
[Demerphq]</div>
<CODE>
# From Knuth Art of Programming
# Algortihm S(3.4.2)
# Select n records at random from a set of N records where
# 0<n<=N
#
# This algortihm is only useful when N is known in advance.
# If n=2 then the average number of elements considered is
# 2/3*N. the General formula is (N+1)n/(n+1)
# Its possible to optimise this even more.
#
# In this case we use $array a reference to an array of items
# and $num for the number of elements we want, we return an
# array of elements
#
# It should be remembered that this will be as random as
# the random number generator being used.
sub selection_sample {
my ($array,$num)=@_;
die "Too few elements (".scalar(@$array).") to select $num from\n"
unless $num<@$array;
my @result;
my $pos=0;
while (@result<$num) {
$pos++ while (rand(@$array-$pos)>($num-@result));
push @result,$array->[$pos++];
}
return \@result
}
# From Knuth Art of Programming
# Algortihm R(3.4.2)
# first argument is a filehandle. Second argument is the desired
# number of records in the sample
# Will die if there are insufficeient records in the file.
# Returns a reference to an array of the selected records.
sub reservoir_sample {
my ($file,$num)=@_;
my @buffer;
while ( <$file> ) {
chomp;
push @buffer,$_;
last if @buffer==$num;
}
die "Insufficient records\n"
if @buffer<$num;
my $pos=@buffer;
while ( <$file> ) {
$pos++;
my $rand=rand($pos);
if ($rand<@buffer) {
chomp;
$buffer[$rand]=$_;
}
}
return \@buffer;
}
</CODE>