Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Perl Monk, Perl Meditation
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

radnorr:

I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)

I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:

  • Run my test a couple more times with different numbers of samples, to see if I could easily vary the slope of the bias.
  • Make the sample array have *3* slots per entry instead of 2. I'd use the third slot to hold the number of times the slot was overwritten and histogram that as well. Just in case it's something like the fencepost error mentioned earlier.

Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:

$ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7, +8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don +e

While it was cooking, I quickly put together something to crunch the files and check:

#!/usr/bin/perl + # # pm_sample_histo_test.pl <FNames> # # Make a brief histogram to see if there's any obvious bias in the sam +pler. # Assumes you're sampling from a file containing a million lines, each + of # which contains just the line number. # use 5.10.1; use strict; use warnings; my %H; for my $FName (@ARGV) { print "-- $FName --\n"; open my $FH, '<', $FName or die "can't open $FName\n"; while (<$FH>) { next if /^random/; my $I = int $_/10000; $H{$I}++; } } my $max=0; for (keys %H) { $max = $H{$_} if $H{$_} > $max; } for my $I (0 .. 99) { my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50); if ($I > 1 and $I < 98) { my $avg; my $sum = 0; $sum += $H{$_} for $I-2 .. $I+2; $avg = $sum/5; substr($bar,int(50 * ($avg/$max)),1)='+'; } printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar; }

It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:

$ ./pm_sample_histo_test.pl t.* 0: (321) ************************************************** 1: (290) ********************************************* 2: (290) *********************************************+ 3: (291) *********************************************+ 4: (281) ******************************************* + 5: (307) *********************************************+* 6: (290) ********************************************+ 7: (283) ********************************************+ 8: (280) ******************************************* + 9: (283) ********************************************+ 10: (309) ********************************************+*** 11: (287) *******************************************+ 12: (267) ***************************************** + 13: (253) *************************************** + 14: (292) *****************************************+*** 15: (257) **************************************** + 16: (278) ******************************************+ 17: (254) *************************************** + 18: (286) ******************************************+* 19: (250) ************************************** + 20: (291) ******************************************+** 21: (277) ******************************************+ 22: (270) ******************************************+ 23: (262) **************************************** + 24: (261) ****************************************+ 25: (263) ***************************************+ 26: (246) ************************************** + 27: (237) ************************************ + 28: (261) ***************************************+ 29: (245) ************************************** + 30: (265) ***************************************+* 31: (267) ***************************************+* 32: (234) ************************************ + 33: (269) ****************************************+ 34: (252) *************************************** + 35: (272) ****************************************+* 36: (259) ****************************************+ 37: (243) ************************************* + 38: (266) ***************************************+* 39: (237) ************************************ + 40: (259) ***************************************+ 41: (256) ***************************************+ 42: (249) ************************************** + 43: (269) ****************************************+ 44: (274) *****************************************+ 45: (266) ****************************************+ 46: (267) ***************************************+* 47: (229) *********************************** + 48: (233) ************************************ + 49: (265) ***************************************+* 50: (247) ************************************** + 51: (281) ****************************************+** 52: (243) ************************************* + 53: (249) ************************************** + 54: (249) ************************************** + 55: (251) **************************************+ 56: (262) *************************************+** 57: (228) *********************************** + 58: (217) ********************************* + 59: (262) *************************************+** 60: (260) **************************************+* 61: (225) *********************************** + 62: (263) ****************************************+ 63: (297) ****************************************+***** 64: (271) ******************************************+ 65: (250) ************************************** + 66: (275) *****************************************+ 67: (267) ***************************************+* 68: (260) ***************************************+ 69: (224) ********************************** + 70: (256) **************************************+ 71: (241) ************************************* + 72: (257) ***************************************+ 73: (256) ***************************************+ 74: (257) ***************************************+ 75: (246) ************************************** + 76: (267) ***************************************+* 77: (267) **************************************+** 78: (232) ************************************ + 79: (239) ************************************* + 80: (252) **************************************+ 81: (269) **************************************+** 82: (234) ************************************ + 83: (239) ************************************* + 84: (240) ************************************* + 85: (259) ***************************************+ 86: (290) ****************************************+**** 87: (249) ************************************** + 88: (256) ***************************************+ 89: (235) ************************************ + 90: (242) ************************************* + 91: (261) **************************************+* 92: (245) **************************************+ 93: (245) ************************************** + 94: (248) **************************************+ 95: (274) ***************************************+** 96: (230) *********************************** + 97: (263) **************************************+* 98: (241) ************************************* 99: (236) ************************************

As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re^2: Get random unique lines from file by roboticus
in thread Get random unique lines from file by radnorr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (14)
    As of 2014-04-24 11:38 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (565 votes), past polls