Re: Random Picks

[Edited to explain the theory more completely.]

Lifted from the Perl Cookbook recipe 8.6 'Picking a Ranom Line from a File' generalized out to N lines rather than 1. Most deep thought comes from Messrs. Christiansen and Torkington. All errors come from me.

# Call srand if you are on an old perl (< 5.004)

my $statement = 'SELECT ssn FROM ...';
my $sth = $dbh->prepare($statement);
$sth->execute();

my $target_lines = 1000;
my @result = ();
my $item_num = 1;
while (my ($ssn) = $sth->fetchrow_array()) {
    if (@result < $target_lines) {
        # Just add it if we have not hit the limit
        push @result, $ssn;
    }
    elsif (rand($item_num) < $target_lines) {
        # Otherwise pick something to replace if the
        # the odds are right
        $result[rand($target_lines)] = $ssn;
    }
} continue {
    $count++;
}
[download]

You can look in the Cookbook for the full description of why this works for a single target_line. The short form is that it will first fill the list with $target_lines elements, then each subsequent line will have a slightly lessened chance of replacing a predecessor. Basically every element in the array should have the same probability of being there at a given point in time. So until the 1001st element each item has a 100% chance of being there. Then the 1001st element has a 1001/1000 chance of replacing something, then there is a 1/1000 chance for each element. Adding that all up it means each element will have a 1001/1000 chance of staying. Then for the 1002nd element, it has a 1002/1000 chance of replacing an element. Again that all adds up to 1002/1000 (1/1000 + 1001/1000) for each element.

The standard caveats about the randomness of the random numbers returned from random apply.

This solution is good if you don't want to slurp all elements in from the DB and if you have to use standard SQL (i.e. no limit clause).

-ben

In Section Seekers of Perl Wisdom