radnorr:
Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:
#!/usr/bin/perl
#
# sample_random_lines_from_file.pl <FName> <NumSamples>
use 5.10.1;
use strict;
use warnings;
use autodie;
my @samples;
my $FName = shift // die "missing: <filename> <numsamples>";
my $num = shift // die "missing: <numsamples>";
open my $FH, '<', $FName;
while (<$FH>) {
if ($num/$. > rand) {
my $i = @samples;
if ($i > $num) { $i = rand @samples; }
#print "slot $i, size=" . scalar(@samples) . ", line $.\n";
$samples[$i]=[ $., $_ ];
}
}
print "random samples:\n";
print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples;
I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:
$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
29748
135818
143918
164669
216447
245165
267754
404776
419876
487740
893947
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
163918
434324
435340
534748
596221
611074
677311
682939
719979
842687
998139
There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).
I hope you find it useful.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.