Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Approach to efficiently choose a random line from a large file

by periapt (Hermit)
on Dec 13, 2004 at 13:47 UTC ( #414381=note: print w/ replies, xml ) Need Help??


in reply to Approach to efficiently choose a random line from a large file

Kuth, in The Art of Computer Programming (Vol II p144 3rd ed.) gives a simple algorithm for selection lines from a file when the total number of lines isn't known ahead of time. How do I pick a random line from a file? seems to be a variation of that idea. It might be worth looking at again.

If you know the number of lines in the file ahead of time, you could try something like this

my $t = 0; # nr of lines already considered my $N = 10; # nr of lines in file do{ my $line = <>; }until(($N-$t++)*rand < 1); print $line;
This assumes you are selecting only one line (there is a simple mod if you want more than one, say m, lines) . You step through the file one line at a time and decide whether to select the current line. If not, you move on to the next one and so on. On average, you will only have to read (N+1)n/(n+1) lines before selecting one. If your files contain 10 million records, you would, on average only read about 5 million lines before selecting one. How do I pick a random line from a file? requires a complete pass through all lines each time. Depending on your setup or requirements, the savings might be worth it.

Update: included the incrementor for $t

PJ
use strict; use warnings; use diagnostics;


Comment on Re: Approach to efficiently choose a random line from a large file
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://414381]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2014-07-25 03:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls