sdyates2001 has asked for the wisdom of the Perl Monks concerning the following question:

How do I process large files line by line. My system cannot handle loading 200MB+ files into memory and then processing them.

Replies are listed 'Best First'.
Re: Large file processed line by line
by btrott (Parson) on Jun 19, 2001 at 05:02 UTC
    The general idiom is to use a while loop to iterate over the lines in the file, reading in one line at a time and processing it, then moving on to the next.

    Something like this:

    open FH, "foo" or die "Can't open foo: $!"; while (<FH>) { ## current line is in $_, process it } close FH or warn "Error closing foo: $!";
    Depending on your situation, you might also want to check out the -p and -n command line flags to perl (perlrun).

    If these are files specified on the command line, you can use the special construct:

    while (<>) { ## line is in $_ }
    This might be useful in a situation like
    $ process.pl foo.txt bar.txt baz.txt
    to process each of the files on the command line.
      Don't forget about the everfaithful -i.bak command line flag. It's one of my favourites. It edits a file "inplace" one line at a time. This should delete everything that doesn't contain the string "foo" (but I haven't tested it. sorry)
      perl -epi.bak "print if(m/foo/);" foo.txt bar.txt baz.txt
      Update:
      Read on for the correct answer. I really should have tested it. Thanks Mirod and Btrott. ++ to both of you. -- for me.

        You really should have tested it:

        • -e should be immediatelly followed by the script to run,
        • -p prints the current line, so you don't have to do it yourself, -n is what you want in this case.

        This (tested!) script would work as:

        perl -i.bak -n -e"print if(m/foo/);" foo.txt bar.txt baz.txt

        From perldoc perlrun:

        -n causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk: LINE: while (<>) { ... # your program goes here } "BEGIN" and "END" blocks may be used to capture control before or after the implicit program loop, just as in awk. -p causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed: LINE: while (<>) { ... # your program goes here } continue { print or die "-p destination: $!\n"; } If a file named by an argument cannot be opened for some reason, Perl warns you about it, and moves on to the next file. Note that the lines are printed automatically. An error occurring during printing is treated as fatal. To suppress printing use the -n switch. A -p overrides a -n switch. "BEGIN" and "END" blocks may be used to capture control before or after the implicit loop, just as in awk.
        Right, -i is quite cool. But all -i does is open the file for in-place editing; it doesn't "edit the file one line at a time". If you notice, you also have the -p option in the above command line; that's actually the switch that's doing the line-by-line processing.