radnorr:
Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:
#!/usr/bin/perl
#
# sample_random_lines_from_file.pl <FName> <NumSamples>
use 5.10.1;
use strict;
use warnings;
use autodie;
my @samples;
my $FName = shift // die "missing: <filename> <numsamples>";
my $num = shift // die "missing: <numsamples>";
open my $FH, '<', $FName;
while (<$FH>) {
if ($num/$. > rand) {
my $i = @samples;
if ($i > $num) { $i = rand @samples; }
#print "slot $i, size=" . scalar(@samples) . ", line $.\n";
$samples[$i]=[ $., $_ ];
}
}
print "random samples:\n";
print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples;
+
I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:
$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
29748
135818
143918
164669
216447
245165
267754
404776
419876
487740
893947
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
163918
434324
435340
534748
596221
611074
677311
682939
719979
842687
998139
There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).
I hope you find it useful.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|