Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Reading HUGE file multiple times

by Laurent_R (Parson)
on Apr 28, 2013 at 16:03 UTC ( #1031092=note: print w/ replies, xml ) Need Help??

in reply to Reading HUGE file multiple times

OK, these are the assumptions and steps I made for my test based on my understanding of your requirements. First I started with a file containing official transcripts in French of a session of the European Parliament that I used to construct a file containing just one text line about 182,700 characters long:

$ wc file182730char.txt 1 28947 182729 file182730char.txt

From there, I built a 5 GB file this way: each time one identifier line with two integer random numbers between 0 and 28888, and one data line containing a copy of the above 187,000 character line, doing this 28,000 times to get my 5 GB file:

$ perl -e '$t = <>; for (1..28000) { $c = int rand 28888; $d = int rand 28888; print "> $c $d \n"; print $t}' file182730char.txt > file5gb.txt
This command took about 6 minutes to execute on my relatively old laptop. The resulting file is about 5.1 billion bytes:
$ time wc file5gb.txt 56000 810600000 5116810585 file5gb.txt real 7m54.609s user 4m3.436s sys 0m10.530s

As you can see, a simple word count (wc command) on the file took almost 8 minutes to run.

The structure of the big file is something like this:

> 12048 6179 reprise de la session [...] commission ne peut tout faire > 1024 7912 reprise de la session [...] commission ne peut tout faire > 3926 17512 reprise de la session [...] commission ne peut tout faire > 15268 6071

(with each data line above being in fact 182,729 character long.)

The idea now is to read this big file, get the two c and d random numbers on each of the identifier lines and print the c-th and d-th fields of the next data line into an output file. This can be done in just one pass over the file like this:

$ perl -ne 'if (/^> /) { ($c, $d) = (split)[1,2];} else { print join " ", (split)[$c,$d]; print "\n"};' file5gb.txt > foo.txt

Extracting the data from the big file took about 16 minutes, so about twice the duration of the simple wc command on the same file (which I think is quite good performance).

The resulting foo.txt file looks like this:

$ head foo.txt ceux cen la incapables les que grand la une en invitant que niveau d au ces consequences que un le

I do not know if my scenario is anywhere close to what you are trying to do, but that is more or less what I understood from your requirement, together with some assumptions on what the identifier might be used for on the data lines.

Your needs might be quite different, but I still hope this helps showing how you can do this type of thing in just one pass through the file.

Comment on Re: Reading HUGE file multiple times
Select or Download Code
Re^2: Reading HUGE file multiple times
by Anonymous Monk on Apr 28, 2013 at 16:30 UTC
    Thanks Laurent, great example. Will give it a try and see how much faster it will be.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031092]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-11-21 08:53 GMT
Find Nodes?
    Voting Booth?

    My preferred Perl binaries come from:

    Results (106 votes), past polls