http://www.perlmonks.org?node_id=1113334

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi PERL Monks, the code below generates the "out of memory" error. I suspect that the code is reading the entire file before writing/saving the results to disk instead of reading the file one line at a time. I am grateful for any insight you may have. Thank you! EDIT: I replaced the "for each" loop with the "while" loop (as suggested by choroba), but continue to receive the out of memory error. FYI, the crash occurs when downloading a 450MB text file, if that helps. Machine has 6gb of memory installed. I apologize for my ignorance, what I know about PERL fits easily in a doll house thimble.

#!/usr/bin/perl use LWP; use HTTP::Request; sub get_http { my $url = shift; my $request = HTTP::Request->new(GET => $url); my $response = $ua->request($request); if (!$response->is_success) { print STDERR "GET '%s' failed: %s\n", $url, $response->status_line; return undef; } return $response->content; } # user agent object for handling HTTP requests my $ua = LWP::UserAgent->new; # if you only want a portion of the filing, un-comment the next line #$ua->max_size(50000); # 50k byte limit ######################### write dir , use "\\" and not "\", for exampl +e: "C:\\temp" $write_dir = "C:\\Volumes\\EDGAR1\\Edgar\\Edgar2\\10K_10Q\\2014"; ######################### write dir ######################### filename with urls (put in same directory as + script) open dlthis, "Data2014.txt" or die $!; ######################### filename with urls (put in same directory as + script) ######################### log open LOG , ">download_log.txt" or die $!; ######################### log my @file = <dlthis>; foreach $line (@file) { #CIK, filename, blank is not used (included because it will capture t +he newline) ($CIK, $get_file, $blank) = split (",", $line); $get_file = "http://www.sec.gov/Archives/" . $get_file; $_ = $get_file; if ( /([0-9|-]+).txt/ ) { $filename = $write_dir . "/" . $CIK . ".txt"; open OUT, ">$filename" or die $!; print "file $CIK \n"; my $request = HTTP::Request->new(GET => $get_file); my $response =$ua->get($get_file ); $p = $response->content; if ($p) { print OUT $p; close OUT; } else { #error logging print LOG "error in $filename - $CIK \n" ; } } } close LOG;

Replies are listed 'Best First'.
Re: Out of memory
by choroba (Cardinal) on Jan 15, 2015 at 15:08 UTC
    I suspect that the code is reading the entire file
    Your suspicion is correct. The following line reads the whole file into an array of lines:
    my @file = <dlthis>;

    Remove the line and change the following for-loop into a while-loop:

    while ($line = <dlthis>) {

    It will process the file line by line (untested).

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Great! This will require a curly bracket to close the WHILE loop, correct? If so, I am placing the closing bracket after the last curly bracket in the code, correct? Thank you so much!!
        No, just reuse the right curly bracket that originally closed the for-loop.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        No like already said you have to replace the for-loop

        Look this is no " copy and pasted code fixing service" , you need to to try to understand what your code does.

        Cheers Rolf

        PS: Je suis Charlie!

Re: Out of memory
by Athanasius (Archbishop) on Jan 15, 2015 at 15:14 UTC

    Hello wrkrbeee,

    When I added use strict to your code, I got a screenful of error messages. The first,

    Global symbol "$ua" requires explicit package name at ...

    highlights what looks like a serious problem: namely, that the $ua accessed within sub get_http is not the lexical variable declared below the sub, but rather an unrelated package global (which is uninitialised).

    Please, add:

    use strict; use warnings;

    to the head of your script and fix the resultant errors before proceeding.

    Update: Ok, sub get_http isn’t actually called in the code shown. But when it is called, in the larger programme, you want it to work correctly, right? Why make things harder for yourself by working without a safety net?

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Look at this
      #!/usr/bin/perl -- use strict; use warnings; use LWP::Simple qw/ $ua /; $ua->show_progress(1); $ua->get('http://example.com'); __END__ ** GET http://example.com ==> 200 OK (2s)
      Thank you for your help! I inherited this program, and simply attempted to modify a miniscule portion to achieve my goal. Hence, the the GLOBAL package you mentioned is foreign to me. Sounds like a utility package incorporated by the user when needed. Guess the big question is "do I need it in this instance?" Sorry for the question.

        Hello

        When Athanasius talked about "package global", he was not referring to a package called "global", but was referring to variables that are global to the current package, which in your case, is probably "main".

        Lexical variables, which are declared with my are only available to the end of enclosing scope. Examples:

        { my $i; { my $k; ...; # both $i and $k are available here } ...; # only $i is available here for my $n (0 .. 9) { ...; # $i and $m are available here } ...; # only $i is available here }
Re: Out of memory
by fishmonger (Chaplain) on Jan 15, 2015 at 15:23 UTC

    How big is Data2014.txt?

    I doubt that the out of memory problem is due to slurping that file into an array instead of looping over it line-by-line.

    I suspect that the out of memory issue is related to the retrieval of the url/file. Instead of copying the content into one or more vars, I'd output it directly to the file.

    EDIT:
    The LWP::Simple module has a getstore() function that will simplify that process.

      Size of Data2014.txt is 1.5 MB, so you are probably correct. Any tips for the output directly to the file? Thank you!