Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Processing large files

by Dr Manhattan (Beadle)
on Aug 21, 2013 at 06:16 UTC ( #1050286=perlquestion: print w/ replies, xml ) Need Help??
Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I have a text file that I want to extract some information from, however the file is too large to read all at once. So at the moment I'm trying to read in 3000 lines at a time, processing and extracting info, printing it, clearing memory and then go on to the next 3000 lines.

This is the code I am currently trying out:

my @array; my $counter = 0; while (<Input>) { my $line = $_; chomp $line; push (@array, $line) if ($counter = 3000) { my @information; foreach my $element (@array) { #extract info from $element and push into @information } for my $x (@information) { print Output "$x\n"; } $counter = 0; @information = (); } }

However when I try this the output file just never stops growing, so I think I might be creating a endless loop somewhere. Any ideas/pointers?

Thanks in advance for any help

Comment on Processing large files
Download Code
Re: Processing large files
by BrowserUk (Pope) on Aug 21, 2013 at 06:25 UTC
    if ($counter = 3000)

    If you had warnings enabled, Perl would tell you:

    if ($counter = 3000) { 1 };; Found = in conditional, should be == at ...

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Processing large files
by Athanasius (Monsignor) on Aug 21, 2013 at 06:36 UTC

    BrowserUk has identified the main problem. In addition:

    Looks like @array’s memory is never cleared. Also, you don’t need to clear @information explicitly — it’s a lexical variable, so will be re-initialised each time the if condition is true. So, change that line to:

    @array = ();

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Processing large files
by Anonymous Monk on Aug 21, 2013 at 06:45 UTC
Re: Processing large files
by kcott (Abbot) on Aug 21, 2013 at 09:19 UTC

    G'day Dr Manhattan,

    This would be an ideal situation in which to use the built-in module Tie::File. That won't suffer from memory issues due to the size of the input file and would allow you to eliminate the need for the while loop, chomp, push, if condition and $counter. Also, you don't appear to be storing data in @information for subsequent use so you can eliminate that variable and the for loop that processes it. Here's roughly what you'd need:

    use strict; use warnings; use autodie; use Tie::File; tie my @input_data, 'Tie::File', 'your_input_filename'; open my $output_fh, '>', 'your_output_filename'; for my $record (@input_data) { my $extracted_info = ...; # extract info from $record here print $output_fh "$extracted_info\n"; } untie @input_data; close $output_fh;

    -- Ken

Re: Processing large files
by derby (Abbot) on Aug 21, 2013 at 11:32 UTC

    Others have pointed out the real problem with your code but I would like to point out that with most IO architectures, you're probably not going to gain much performance by buffering your input this way -- the underlying library calls for read are probably already buffering. You may want to Benchmark your buffering approach with a standard line-by-line approach. If the differences are minimal, I would opt for the simpler code.

    -derby
Re: Processing large files
by Laurent_R (Vicar) on Aug 21, 2013 at 11:39 UTC

    Why don't you simply read one line at a time, process it, print out what you need to output, and then go to the next line? FH iterators are great, and input buffering is done under the surface anyway (unless you take steps to prevent it).

Re: Processing large files
by mtmcc (Hermit) on Aug 21, 2013 at 12:17 UTC
    As well as the above points, you don't seem to increment $counter at any point either, so it won't reach 3000...
Re: Processing large files
by Preceptor (Chaplain) on Aug 21, 2013 at 19:06 UTC

    I can't actually understand why you're trying to read line by line, and then batch process every 3000. Are the data in those 3000 lines in some way correlated? Otherwise, you're not really doing much good - a 'while' loop will do what you want without needing to buffer anything.

    while ( my $line = <Input> ) { #do stuff; print output }

    It does depend a little though, what some of your loops are doing. But a 'while' based traverse of a file won't read the whole file all at once (unless you deliberately 'make it' do that).

Re: Processing large files
by zork42 (Monk) on Aug 22, 2013 at 06:07 UTC
    When you've fixed the bugs mentioned above, you'll also need to squash this bug:

    Once the while loop exits, you need to process the remaining 1 to 2999 lines in @array.

    Whenever you process anything in "chunks" > 1 thing, always remember to process the final partial chunk (if it exists).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1050286]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2014-08-21 04:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (127 votes), past polls