Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

radnorr:

Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:

#!/usr/bin/perl # # sample_random_lines_from_file.pl <FName> <NumSamples> use 5.10.1; use strict; use warnings; use autodie; my @samples; my $FName = shift // die "missing: <filename> <numsamples>"; my $num = shift // die "missing: <numsamples>"; open my $FH, '<', $FName; while (<$FH>) { if ($num/$. > rand) { my $i = @samples; if ($i > $num) { $i = rand @samples; } #print "slot $i, size=" . scalar(@samples) . ", line $.\n"; $samples[$i]=[ $., $_ ]; } } print "random samples:\n"; print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples; +

I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:

$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 29748 135818 143918 164669 216447 245165 267754 404776 419876 487740 893947 marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 163918 434324 435340 534748 596221 611074 677311 682939 719979 842687 998139

There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).

I hope you find it useful.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: Get random unique lines from file by roboticus
in thread Get random unique lines from file by radnorr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (5)
    As of 2015-07-05 08:48 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (61 votes), past polls