http://www.perlmonks.org?node_id=1013768

Chuma has asked for the wisdom of the Perl Monks concerning the following question:

I've got a 58 GB XML file that I need to go through and do various regex-related things with.

First I tried the obvious

open in, shift @ARGV; for(<in>){
I assumed that it would read one line at a time, but apparently it tries to read the whole file into memory. I don't have that much memory, in fact I don't have that much hard drive space, so that's not a good idea, and I don't see why Perl would think so.

Then I found some module called Tie::file. If I do

tie @in, 'Tie::File', shift @ARGV; for(@in)
there is no improvement, but if instead I do
tie @in, 'Tie::File', shift @ARGV; while(1){ $_=$in[$i++];
it seems to work. Unfortunately it's still a bit slow - it initially processes about 10 MB per minute, which means it would take upwards of four days to process the whole thing. And that's assuming that it actually continues at constant speed.

I tried changing the program so that I can pause it and continue later, so I can run it at night, or something. I let the program tell me which line it's on, and then take that as input the next time it starts. But of course looking up the right line number when it starts again takes a certain amount of time, which appears to be more than linear in number of lines, so that's not going to work.

Is there some way to convince Perl to read one line at a time? Or some other clever workaround?

Replies are listed 'Best First'.
Re: Going through a big file
by LanX (Saint) on Jan 17, 2013 at 12:37 UTC
    > First I tried the obvious

    open in, shift @ARGV; for(<in>){

    Obvious?

    You will hardly find any example in the perldocs ever trying this.

    Rather

    open in, shift @ARGV; while (<in>){

    or better

    open my $in, '<', shift @ARGV; while (my $line = <$in>){

    Explanation

    your code is semantically equivalent to

    my @temp_list=<in>; for(@temp_list){
    slurping the whole file as a first step.

    for expects a list and evaluates in list context¹, so it's greedily swallowing all at once.

    But while iterates in scalar context, that is line by line (as defined by '$/')

    Cheers Rolf

    ¹) with a little magical exception in recent Perl versions (>= 5.8 ?) regarding ranges, which isn't relevant here

      Oh! Well, that explains it. Thanks very much!
Re: Going through a big file
by RMGir (Prior) on Jan 17, 2013 at 13:03 UTC
    Aside from the excellent advice above about using while rather than for, you may also want to consider looking at the File::SortedSeek module.

    It's not directly applicable to what you're doing, but you could use it for inspiration - if your XML is in any kind of sorted order, you can save ENORMOUS amounts of time by doing binary searches if you only need to process a small subset of the file.


    Mike
      Thanks, I do actually need to process the whole thing, but I'll keep that in mind for later anyway.
Re: Going through a big file [solved]
by space_monk (Chaplain) on Jan 17, 2013 at 16:15 UTC
    lanX's comment at Re: Going through a big file excellently explains the difference between how for slurps the whole lot into an array, but while does it line by line. The only comment I would add is why bother to have it open the files manually when it can be done implicitly?
    while (<>) { }
    does the job. Arguments other than the file(s) you want to process should be flags. In fact, you don't even need to write this code as it is automatically assumed when you run perl with flags such as -n or -p. See perlrun for more information.
    A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: Going through a big file [solved]
by topher (Scribe) on Jan 18, 2013 at 00:18 UTC

    I'd recommend following the advice pointed to by Anonymous Monk's question, and investigating a CPAN module to assist you with the XML processing. For example, I know XML::Twig can handle huge XML files, and I'm certain there are others available, too.

    I love regular expressions, and they've helped me solve all sorts of nightmare formats, but XML is already a structured data source, using a tool that lets you take advantage of that structure will usually make things easier on you and more robust.

Re: Going through a big file
by Anonymous Monk on Jan 17, 2013 at 14:26 UTC
    Is this a thing that can be done using an XML library? XPath? XSLT?
Re: Going through a big file [solved]
by pvaldes (Chaplain) on Jan 22, 2013 at 16:15 UTC
    And, just for the purpose of being annoying, you can try also something in between, to read your huge file in chunks of n characters or n lines. See "perldoc -f read", and "perldoc -v $."