Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Randomly select N lines from a file, on the fly

by blokhead (Monsignor)
on Oct 09, 2008 at 01:00 UTC ( #716109=snippet: print w/ replies, xml ) Need Help??

Description: Many people know the trick from perlfaq about how to choose a line uniformly at random from a file (or pipe), without knowing a priori how many lines are coming.

Here is a generalization of the method that chooses a random subset (without repetition) of N random lines from a file. The method only needs to keep N lines of the file in memory. It also preserves the ordering of lines. If the little script is named sample, you use it like this:

$ sample 10 somelongfile.txt ## to get 10 random lines $ some long command | sample 50 > mysample.txt
Proof of correctness is fairly straight-forward by induction.

Note: perlfaq recommends File::Random for the case of choosing 1 random line. And indeed, the random_line function in that module has an option to choose more than 1 line. However, it selects with repetition.

my $wanted = shift || 10;
my @got;

die "Invalid number of lines!\n" if $wanted < 1;

while (<>) {
  if (@got < $wanted) {
      push @got, $_;
  } elsif (rand($.) < $wanted) {
      splice @got, rand(@got), 1;
      push @got, $_;
  }
}

die "Not enough lines!\n" if @got < $wanted;
print @got;
Comment on Randomly select N lines from a file, on the fly
Download Code
Replies are listed 'Best First'.
Re: Randomly select N lines from a file, on the fly
by ambrus (Abbot) on Oct 09, 2008 at 09:56 UTC

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://716109]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2015-07-08 06:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (94 votes), past polls