Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Reading HUGE file multiple times

by Laurent_R (Prior)
on Apr 28, 2013 at 16:03 UTC ( #1031092=note: print w/ replies, xml ) Need Help??


in reply to Reading HUGE file multiple times

OK, these are the assumptions and steps I made for my test based on my understanding of your requirements. First I started with a file containing official transcripts in French of a session of the European Parliament that I used to construct a file containing just one text line about 182,700 characters long:

$ wc file182730char.txt 1 28947 182729 file182730char.txt

From there, I built a 5 GB file this way: each time one identifier line with two integer random numbers between 0 and 28888, and one data line containing a copy of the above 187,000 character line, doing this 28,000 times to get my 5 GB file:

$ perl -e '$t = <>; for (1..28000) { $c = int rand 28888; $d = int rand 28888; print "> $c $d \n"; print $t}' file182730char.txt > file5gb.txt
This command took about 6 minutes to execute on my relatively old laptop. The resulting file is about 5.1 billion bytes:
$ time wc file5gb.txt 56000 810600000 5116810585 file5gb.txt real 7m54.609s user 4m3.436s sys 0m10.530s

As you can see, a simple word count (wc command) on the file took almost 8 minutes to run.

The structure of the big file is something like this:

> 12048 6179 reprise de la session [...] commission ne peut tout faire > 1024 7912 reprise de la session [...] commission ne peut tout faire > 3926 17512 reprise de la session [...] commission ne peut tout faire > 15268 6071

(with each data line above being in fact 182,729 character long.)

The idea now is to read this big file, get the two c and d random numbers on each of the identifier lines and print the c-th and d-th fields of the next data line into an output file. This can be done in just one pass over the file like this:

$ perl -ne 'if (/^> /) { ($c, $d) = (split)[1,2];} else { print join " ", (split)[$c,$d]; print "\n"};' file5gb.txt > foo.txt

Extracting the data from the big file took about 16 minutes, so about twice the duration of the simple wc command on the same file (which I think is quite good performance).

The resulting foo.txt file looks like this:

$ head foo.txt ceux cen la incapables les que grand la une en invitant que niveau d au ces consequences que un le

I do not know if my scenario is anywhere close to what you are trying to do, but that is more or less what I understood from your requirement, together with some assumptions on what the identifier might be used for on the data lines.

Your needs might be quite different, but I still hope this helps showing how you can do this type of thing in just one pass through the file.


Comment on Re: Reading HUGE file multiple times
Select or Download Code
Re^2: Reading HUGE file multiple times
by Anonymous Monk on Apr 28, 2013 at 16:30 UTC
    Thanks Laurent, great example. Will give it a try and see how much faster it will be.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031092]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2015-07-04 03:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (57 votes), past polls