Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Out of Memory Error -- Possible Leak?

by Anonymous Monk
on Dec 13, 2005 at 23:26 UTC ( [id://516473]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

G'Day Oh Wise Ones

I come to ye with some great confuzzlement. I have a loop like so:

for (my $i = 0; $i < scalar(@array); $i++) { for (my $j = $i; $j < scalar(@array); $j++) { foreach my $value (@{$hash{$array[$i]}{$array[$j]}}){ # do great deeds } } }

This loop does crash. And, it is apparently the foreach that is causing the crash, as this still crashes:

for (my $i = 0; $i < scalar(@array); $i++) { for (my $j = $i; $j < scalar(@array); $j++) { foreach my $value (@{$hash{$array[$i]}{$array[$j]}}) { next(); # do great deeds } } }

But importantly this does not crash:

for (my $i = 0; $i < scalar(@array); $i++) { for (my $j = $i; $j < scalar(@array); $j++) { next(); foreach my $value (@{$hash{$array[$i]}{$array[$j]}}) { # do great deeds } } }

I should be clear: the data structures are exactly as presented. @array is a simple array, while %hash is a hash of hashes of arrays. This HoHoA is really a DBM::Deep object, however, as I had expected this to be more efficient on memory usage.

I should also point out: this code works properly with smaller datasets, but not with my full dataset, which includes millions of data-points in the HoHoA and tends of thousands in the arrays.

In my simple-minded view this indicates a memory-leak in the foreach loop. Am I missing something? In general, should I be doing something different rather than a foreach? Or am I vastly confuzzled and missing something obvious to ye great and mighty ones?

Replies are listed 'Best First'.
Re: Out of Memory Error -- Possible Leak?
by graff (Chancellor) on Dec 14, 2005 at 04:05 UTC
    The scenario implies that the HoHoA is already loaded at the point that you go into this three-level loop over the data structure, and you obviously have not run out of memory at that point. So there must be something about how you are trying to access the data structure that is causing perl to consume a lot more memory than was needed to store the original data structure.

    You say the structure is "not sparse", but if loading the structure happens to take up, say, 80% of available memory, then you might end up going over the limit if you auto-vivify a relatively small percentage of previously non-existent cells in the overall structure.

    I wonder if it would help to try the loop like this:

    for (my $i = 0; $i < scalar(@array); $i++) { my $subhash = $hash{$array[$i]}; # does not autovivify new $hash +element next unless $subhash; # equivalent to "exists($hash{$array +[$i]})" for (my $j = $i; $j < scalar(@array); $j++) { my $subarray = $$subhash{$array[$j]}; # same comments a +s above next unless $subarray; foreach my $value ( @$subarray ) { # do great deeds } } }
    If the OP code actually works on a given (smaller) data set, it might turn out that this different form of the nested looping could produce the same output in less time (but I haven't tested that).

    The point of this approach is that you can easily check whether a given hash element exists before trying to use its contents as an array ref, and skip it if it does not exist; this involves extra steps within the loops, but these could end up saving a lot of execution time overall. And if they save enough unnecessary memory consumption as well (making the difference between crashing and finishing), speed may be a lesser concern.

    update: I think the code as suggested above would be "good enough", if you're confident about how the data structure was being created, but I would be tempted to be more careful:

    next unless ( ref( $subhash ) eq 'HASH' ); ... next unless ( ref( $subarray ) eq 'ARRAY' ); ...

      Hello, thanks for the suggestions!

      I think I should have been specific -- when I said "not sparse", I actually meant that the data structure had been fully created in advance with a loop exactly like this. I've added in the checks, and before the program the program crashes (about 3 hours of CPU) it does not every autovivify.

      One thing that goes on inside the #do great deeds is to change values in the Array part of the HoHoA, and I had expected that to be the cause of my memory leak. I'm really confuzzled as to why this:

      foreach my $value ( @$subarray ) {

      would eat up memory. I would have guessed it would use one scalar's worth (perhaps 15 bytes of data), and that this memory would get reused over & over again. Instead, the script runs on this loop for a few hours before using up all the memory.

        I'm really confuzzled as to why this:

        foreach my $value ( @$subarray ) {

        would eat up memory.

        That line expands the contents of the array into a list (Ie. On Perls' stack). If the subarray pointed by $subarray is large, it will consume a large amount of memory to do that.

        You can avoid that by indexing the array, rather than aliasing it's elements. Eg.

        for my $index ( 0 .. $#$subarray ) { my $value = $subarray->[ $index ]; ... }

        Or, if you need to modify the elements, substitue $subarray->[ $index ] wherever you are currently using $value.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      YES YES YES!!!

      No matter how sure I was about the data-structure being properly created, I decided to out code like inside my loop:

      if ( exists( $data{$array[$i]}{$array[$j]} ) ) { next(); } else { print OUT join("\t", $i, $j), "\n"; }

      and... wouldn't ya know it, there *were* elements being autovivified. I've fixed that problem, and now the code appears to be running correctly (appears = its been running for a day without crashing yet -- it still has another couple of days to run, though).

      Many thanks -- this has led me to one key problem in the code. Any monks who haven't done so yet, please ++ the parent.

Re: Out of Memory Error -- Possible Leak?
by Zaxo (Archbishop) on Dec 13, 2005 at 23:44 UTC

    How much memory do you have? Your data structures are immense. It looks like you have perhaps 100,000,000 arrays in the hash, and even if most are not there you are autovivifying their keys ( scalar(@array) different anonymous hashes) and a reference to an empty array.

    All those structures cost fifty-some bytes each plus a hefty multiple of the number of data elements.

    I don't see a leak, just prolific use of memory. Can you read and use less of the data at a time?

    After Compline,
    Zaxo

      The machine itself has 32 GB, but I'm using a 32-bit compiled Perl so presumably it can only address 4 of those.

      Also, 100 million arrays is a realistic estimate (its not sparse). I had thought that by using DBM::Deep I would be storing the majority of this data on disk, so that the actual amount stored in memory would be much smaller than 100 million x size(data-structure). Am I misguided on that?

      maybe it's because the garbage collector does not have time to run ...
        That isn't possible, because Perl doesn't use a separate garbage collector. It simply reference-counts all its data (scalars, arrays, hashes etc) and frees an item when its reference count drops to zero.

        (This is why circular data structures cause memory leaks.)

Re: Out of Memory Error -- Possible Leak?
by runrig (Abbot) on Dec 13, 2005 at 23:50 UTC
    You are close to running out of memory before you hit the foreach loop. You should probably start thinking about using some sort of database rather than keeping all that data in memory.

      Actually this originally was in a database, but the performance.... Benchmarking suggested it would take 5 months to complete, and since this was to be an annual task we're aiming to get it down to two or three weeks. Since DB access was the second biggest cost, it appeared to be an obvious place to tune, so this is an attempt to *replace* the DB look-ups and to a space/time trade-off.

        I can't say I've ever even tried to put that much data into memory at once, but I have dealt with some large collections. You might try chunking up the original data and storing it in temporary tables you can iterate through so you have less to work with at one time. If your database provides something like LOAD DATA INFILE, you can make use of it to dramatically reduce transaction times when it's time to put all the data back.

        That all said, if it were mine to do the single word that would keep me awake at night is "cluster".

        Good luck and have fun with it!

Re: Out of Memory Error -- Possible Leak?
by choedebeck (Beadle) on Dec 13, 2005 at 23:42 UTC
    Saying the code crashes on large datasets doesn't tell us much. Without some kind of error message I can only speculate at the problem.

    With that said, I don't see an obvious error in your loop. This would lead me to believe that the problem is somewhere in the "# do great deads" section, but without any more information it is hard to say.

      I thought I'd rule that out by placing a next() prior to that though?

      Either way the specific error is "Out of Memory!".

        It seems I wrote without fully reading your code examples. I appologize.
Re: Out of Memory Error -- Possible Leak?
by TomDLux (Vicar) on Dec 14, 2005 at 18:33 UTC

    If you really need to access 10,000x10,000 arrays, each containing gosh-only-knows-how-many elements, perhaps you want to rethink your algorithmn, and maybe implement it in C, at least the core components.

    As much as I love Perl, there are places where it is not the appropriate solution.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Out of Memory Error -- Possible Leak?
by ok (Beadle) on Dec 14, 2005 at 23:44 UTC
    When processing ginormous data sets, try to use while instead of for.

    for wants to load the entire dataset into memory but while wants to load each datum as you request it.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Out of Memory Error -- Possible Leak?
by Moron (Curate) on Dec 15, 2005 at 12:19 UTC
    Although arguably more perlish than for, foreach has the downside of duplicating the entire set in memory before iterating through it. Converting the foreach to a for ( k=0; ...; ) construction would avoid that.

    -M

    Free your mind

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://516473]
Approved by kutsu
Front-paged by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-18 06:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found