Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Reading HUGE file multiple times

by Anonymous Monk
on Apr 28, 2013 at 00:53 UTC ( #1031021=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a 5GB file that has identifiers lines followed by very long data lines (single lines in both cases). In a loop I get coordinates which tell me what identifier I need and what part of the corresponding data I need to extract and modify. The problem I have is that this loop goes through >1000 repetitions and reading the file each time is a dump idea. I was thinking about putting it into a hash but not sure about memory limitations. Any idea on how to tackle it? Speed is really an important factor. Maybe do a system call with qx and do a linux grep command? I have to get away from the computer for couple of hours so thanks in advance!

Comment on Reading HUGE file multiple times
Re: Reading HUGE file multiple times
by BrowserUk (Pope) on Apr 28, 2013 at 01:06 UTC

    Index the file in one pass; then use the index to seek the id/data directly:

    #! perl -slw use strict; my %idx; ## Index the file $idx{ <> } = tell( ARGV ), scalar <> until eof(); for ( 1 .. 1000 ) { my $id = getNextId( ... ); seek ARGV, $idx{ $id }; scalar <>; # discard id line (or verify) print scalar <>; ## access data; }

    Untested code for flavour only.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      thanks, will try it right away

        On my system, the code above indexed a 6.4 million record, 5GB file in 57 seconds.

        1367141700 1367141757 6348909

        Once indexed, accessing the records randomly runs at 1 second per thousand.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reading HUGE file multiple times
by dsheroh (Parson) on Apr 28, 2013 at 10:46 UTC
    Depending on what you're doing with the data, another option might be to do one pass over it to pre-process the data (extracting individual fields, etc.) and insert it into a database, then use the database for later operations.
Re: Reading HUGE file multiple times
by Laurent_R (Parson) on Apr 28, 2013 at 10:46 UTC

    Hi,

    I am dealing just about every day with somewhat similar problems on huge data files, and I am fairly confident that it should be possible to read the file only once (or at most twice), but you don't give enough information about the structure of the file.

    Is my understanding correct that you first have a bunch of identifier lines (1000+), and then you data lines? And the identifier lines some how give the rules as to what to do with the data lines? Or do you have one identifier line giving information about what to to on the next data line or next data lines?

    Please tell us more about the identifiers: do they say on which data line numbers to do something? Or which field to extract in the data line?

    In all cases, I believe that it should most probably be possible to read your file sequentially only once, record what you have in the identifier line and use that for processing the data lines coming afterwards. But I can't say more on how to do it without a better idea of your data format or, even better, a simplified sample of your file content together with some explanation on how to use the identifiers to analyze the data lines.

      Hi there,

      Thanks for the tips. My data looks something like

      >ID

      Data (a verrry long string of varying length in a single line)

      >ID again

      Data again

      Indexing might be a good idea. Maybe I could only read the IDs (skipping the next line) and then when accessing just add +1 to the index? I need to extract them twice in a code in different subroutines and each time the subroutine specifies what to do with them. I don't know if it is a good idea to store it all in a hash. I only need to extract a fragment of the data in first read and the whole data entry in the other. I don't have the IDs in advance, the suroutine specifies which one I need and what to do with it. I've tried

      $Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof();

      but it takes very long time to do. I wonder if there is a better way to do it since this would be a bottleneck.

        I've tried $Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof(); but it takes very long time to do.

        How long?

        I wonder if there is a better way to do it since this would be a bottleneck.

        If you are re-using the file many times you could construct the index and write it to a separate file. It should be substantially quicker to read the index from a file than to construct it from scratch each time.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Forgot to mention - The ID is only used to identify the data. So basically I know I need ID_1 and I extract the corresponding data set. Afterwards the code has different variables on how to modify it, which are not related to the ID.
Re: Reading HUGE file multiple times
by karlgoethebier (Curate) on Apr 28, 2013 at 14:04 UTC
Re: Reading HUGE file multiple times
by Laurent_R (Parson) on Apr 28, 2013 at 16:03 UTC

    OK, these are the assumptions and steps I made for my test based on my understanding of your requirements. First I started with a file containing official transcripts in French of a session of the European Parliament that I used to construct a file containing just one text line about 182,700 characters long:

    $ wc file182730char.txt 1 28947 182729 file182730char.txt

    From there, I built a 5 GB file this way: each time one identifier line with two integer random numbers between 0 and 28888, and one data line containing a copy of the above 187,000 character line, doing this 28,000 times to get my 5 GB file:

    $ perl -e '$t = <>; for (1..28000) { $c = int rand 28888; $d = int rand 28888; print "> $c $d \n"; print $t}' file182730char.txt > file5gb.txt
    This command took about 6 minutes to execute on my relatively old laptop. The resulting file is about 5.1 billion bytes:
    $ time wc file5gb.txt 56000 810600000 5116810585 file5gb.txt real 7m54.609s user 4m3.436s sys 0m10.530s

    As you can see, a simple word count (wc command) on the file took almost 8 minutes to run.

    The structure of the big file is something like this:

    > 12048 6179 reprise de la session [...] commission ne peut tout faire > 1024 7912 reprise de la session [...] commission ne peut tout faire > 3926 17512 reprise de la session [...] commission ne peut tout faire > 15268 6071

    (with each data line above being in fact 182,729 character long.)

    The idea now is to read this big file, get the two c and d random numbers on each of the identifier lines and print the c-th and d-th fields of the next data line into an output file. This can be done in just one pass over the file like this:

    $ perl -ne 'if (/^> /) { ($c, $d) = (split)[1,2];} else { print join " ", (split)[$c,$d]; print "\n"};' file5gb.txt > foo.txt

    Extracting the data from the big file took about 16 minutes, so about twice the duration of the simple wc command on the same file (which I think is quite good performance).

    The resulting foo.txt file looks like this:

    $ head foo.txt ceux cen la incapables les que grand la une en invitant que niveau d au ces consequences que un le

    I do not know if my scenario is anywhere close to what you are trying to do, but that is more or less what I understood from your requirement, together with some assumptions on what the identifier might be used for on the data lines.

    Your needs might be quite different, but I still hope this helps showing how you can do this type of thing in just one pass through the file.

      Thanks Laurent, great example. Will give it a try and see how much faster it will be.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1031021]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-09-20 10:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (158 votes), past polls