comment on

radnorr:

Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:

#!/usr/bin/perl                             
#                                           
# sample_random_lines_from_file.pl  <FName> <NumSamples>     
use 5.10.1;                                 
use strict;                                 
use warnings;                               
use autodie;                                
                                            
my @samples;                                
                                            
my $FName = shift // die "missing: <filename> <numsamples>"; 
my $num = shift // die "missing: <numsamples>";              
                                            
open my $FH, '<', $FName;                   
while (<$FH>) {                             
    if ($num/$. > rand) {                   
        my $i = @samples;                   
        if ($i > $num) { $i = rand @samples; }
        #print "slot $i, size=" . scalar(@samples) . ", line $.\n";
        $samples[$i]=[ $., $_ ];            
    }                                       
}                                           
                                            
print "random samples:\n";                  
print $$_[1] for sort { $$a[0] <=> $$b[0] }  @samples;
[download]

I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:

$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines

marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
29748
135818
143918
164669
216447
245165
267754
404776
419876
487740
893947

marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
163918
434324
435340
534748
596221
611074
677311
682939
719979
842687
998139
[download]

There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).

I hope you find it useful.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re: Get random unique lines from file by roboticus
in thread Get random unique lines from file by radnorr

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks