Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How to read in large files

by Only1KW (Sexton)
on Jan 28, 2016 at 15:34 UTC ( #1153883=perlquestion: print w/replies, xml ) Need Help??
Only1KW has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to parse a large file (~10 GB) but having issues with running out of system memory to do so. I start out with about 3.5 GB used and end up using all 16 GB before Linux kills the process.

Here is the code that is failing for me:

#!/usr/bin/perl use strict; use warnings; $| = 1; open(my $readHandle, '<', "File.txt") or die "Failed\n"; print "Start Read\n"; foreach my $line (<$readHandle>) { print "Read Line\n"; print "Found!\n" if ($line =~ /MatchingText/); } close $readHandle;

When I run this program, "Start Read" is printed to the screen, but I never see "Read Line".

I've done a bunch of Googling on this, and found a bunch of hits, but everything I read says to just read the file one line at a time to get around the size. But isn't that what I'm already doing??? All the specific examples I see for fixing the issue focus on other things the user is also doing which are also taking up memory.

Replies are listed 'Best First'.
Re: How to read in large files
by Corion (Pope) on Jan 28, 2016 at 15:35 UTC

    You're reading the complete file into memory before doing anything. If your program logic allows for this, it's easy to rewrite it by processing the file line-by-line. Change:

    foreach my $line (<$readHandle>) {

    to

    while (defined( my $line= <$readHandle>)) {

      Corion has correctly identified the issue, but I think a little more explanation might be helpful. When you use foreach, perl constructs the entire list before iterating over it. Using while, on the other hand, executes exactly 1 read attempt per loop since while executes after each true evaluation (successful read). In addition, the stored value goes out of scope at the end of each iteration, thus you only store 1 line at a time instead of all of them.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        And this nuance occurs because:

        • In the case of foreach ( EXPRESSION ) { BLOCK }, the EXPRESSION is evaluated in list context. The <FILEHANDLE> operator returns a list of records from the file when evaluated in list context. Logical records are typically based on lines within the file.
        • In the case of while ( EXPRESSION ) { BLOCK }, the EXPRESSION is evaluated in scalar context for its Boolean value. The <FILEHANDLE> (diamond) operator returns a single record from the file when evaluated in scalar context.

        Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1153883]
Approved by Athanasius
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2017-12-11 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (306 votes). Check out past polls.

    Notices?