Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
These two snippets implement Algortihm S(3.4.2) and Algortihm R(3.4.2) from Knuth's Art Of programming.

The first randomly selects N items from an array of elements, and returns a reference to an array containing the elements. Note that it will not necessarily consider all of the elements in the list.

The second randomly selects N items from a file of indeterminate size and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)

The usual caveats about randomness in computers in general and the randomness of perls rand() function apply to these routines.

NOTE The routines are not as efficient as they could be. Instead of considering a record and determining if the record should be included in the result set, they could be modified to determine how many elements to skip before including a record. That modification is left to the reader.

Any errors are almost certainly in my interpretation of Knuths algortihm, and not in the actual technique.

<strict>UPDATE</strict>And im so glad I put in the above caveat. ;-) Anonymonk is right, I posted an incorrect version first time. Fixed.

Demerphq

# From Knuth Art of Programming # Algortihm S(3.4.2) # Select n records at random from a set of N records where # 0<n<=N # # This algortihm is only useful when N is known in advance. # If n=2 then the average number of elements considered is # 2/3*N. the General formula is (N+1)n/(n+1) # Its possible to optimise this even more. # # In this case we use $array a reference to an array of items # and $num for the number of elements we want, we return an # array of elements # # It should be remembered that this will be as random as # the random number generator being used. sub selection_sample { my ($array,$num)=@_; die "Too few elements (".scalar(@$array).") to select $num from\n" unless $num<@$array; my @result; my $pos=0; while (@result<$num) { $pos++ while (rand(@$array-$pos)>($num-@result)); push @result,$array->[$pos++]; } return \@result } # From Knuth Art of Programming # Algortihm R(3.4.2) # first argument is a filehandle. Second argument is the desired # number of records in the sample # Will die if there are insufficeient records in the file. # Returns a reference to an array of the selected records. sub reservoir_sample { my ($file,$num)=@_; my @buffer; while ( <$file> ) { chomp; push @buffer,$_; last if @buffer==$num; } die "Insufficient records\n" if @buffer<$num; my $pos=@buffer; while ( <$file> ) { $pos++; my $rand=rand($pos); if ($rand<@buffer) { chomp; $buffer[$rand]=$_; } } return \@buffer; }

In reply to Random Sampling by demerphq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-19 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found