The first randomly selects N items from an array of elements, and returns a reference to an array containing the elements. Note that it will not necessarily consider all of the elements in the list.
The second randomly selects N items from a file of indeterminate size and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)
The usual caveats about randomness in computers in general and the randomness of perls rand() function apply to these routines.
NOTE The routines are not as efficient as they could be. Instead of considering a record and determining if the record should be included in the result set, they could be modified to determine how many elements to skip before including a record. That modification is left to the reader.
Any errors are almost certainly in my interpretation of Knuths algortihm, and not in the actual technique.
<strict>UPDATE</strict>And im so glad I put in the above caveat. ;-) Anonymonk is right, I posted an incorrect version first time. Fixed.
# From Knuth Art of Programming # Algortihm S(3.4.2) # Select n records at random from a set of N records where # 0<n<=N # # This algortihm is only useful when N is known in advance. # If n=2 then the average number of elements considered is # 2/3*N. the General formula is (N+1)n/(n+1) # Its possible to optimise this even more. # # In this case we use $array a reference to an array of items # and $num for the number of elements we want, we return an # array of elements # # It should be remembered that this will be as random as # the random number generator being used. sub selection_sample { my ($array,$num)=@_; die "Too few elements (".scalar(@$array).") to select $num from\n" unless $num<@$array; my @result; my $pos=0; while (@result<$num) { $pos++ while (rand(@$array-$pos)>($num-@result)); push @result,$array->[$pos++]; } return \@result } # From Knuth Art of Programming # Algortihm R(3.4.2) # first argument is a filehandle. Second argument is the desired # number of records in the sample # Will die if there are insufficeient records in the file. # Returns a reference to an array of the selected records. sub reservoir_sample { my ($file,$num)=@_; my @buffer; while ( <$file> ) { chomp; push @buffer,$_; last if @buffer==$num; } die "Insufficient records\n" if @buffer<$num; my $pos=@buffer; while ( <$file> ) { $pos++; my $rand=rand($pos); if ($rand<@buffer) { chomp; $buffer[$rand]=$_; } } return \@buffer; }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Random Sampling
by ferrency (Deacon) on Jun 25, 2002 at 20:18 UTC | |
by demerphq (Chancellor) on Jun 26, 2002 at 11:15 UTC | |
Re: Random Sampling
by Anonymous Monk on Jun 25, 2002 at 20:22 UTC | |
by demerphq (Chancellor) on Jun 26, 2002 at 17:02 UTC |