Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

memory issues

by asqwerty (Acolyte)
on Jan 28, 2013 at 08:53 UTC ( #1015642=perlquestion: print w/replies, xml ) Need Help??
asqwerty has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I'm trying to do a simple meta analysis in a few databases. The format of DBs is this:

CHR1 CHR2 SNP1 SNP2 OR_INT STAT P 17 18 rs9912311 rs9965425 0.9307 0.06328 0.8014 17 18 rs9912311 rs9963148 0.9307 0.06328 0.8014 17 18 rs9912311 rs9959874 0.9668 0.01788 0.8936 17 18 rs9912311 rs1893506 1.091 0.07564 0.7833 17 18 rs9912101 rs9965425 0.9003 0.1249 0.7238 17 18 rs9912101 rs9963148 0.9003 0.1249 0.7238 17 18 rs9912101 rs9959874 0.9507 0.0376 0.8462 17 18 rs9912101 rs1893506 1.029 0.007849 0.9294 17 18 rs9905581 rs9965425 0.9003 0.1249 0.7238

I have 5 DBs with around 30k lines each. So I write these lines,

use strict; use warnings; use File::Slurp qw(read_file); use Math::CDF qw(qnorm pnorm); use List::MoreUtils qw(uniq); my $ofile = "meta1.txt"; my @ifiles = @ARGV; my %ipairs; my @lpairs; foreach my $ifile (@ifiles){ (my $fk) = $ifile =~ /^(.*)\_sets.*/; my %ldata = reverse map {/^(.*(rs\d{1,20}\s+rs\d{1,20}).*)$/} grep + {/.*rs\d{1,20}\s+rs\d{1,20}.*/} read_file $ifile; foreach my $dline (sort keys %ldata){ push @lpairs, $dline; ($ipairs{$fk}{$dline}{'head'}, $ipairs{$fk}{$dline}{'effect'}, + $ipairs{$fk}{$dline}{'pvalue'}) = $ldata{$dline} =~ /^(.*)\s+(\d\.\d ++)\s+\d\.\d+\s+(\d\.\d+)$/; } } @lpairs = uniq @lpairs; open OF, ">$ofile"; my $head = "CHR1 CHR2 SNP1 SNP2 P N"; print OF "$head\n"; foreach my $pair (@lpairs) { my $n = 0; my $z = 0; my $hl; my $pvalue = 0; my $fk; foreach $fk (%ipairs) { if($ipairs{$fk}{$pair}{'pvalue'}){ unless($hl){ $hl = $ipairs{$fk}{$pair}{'head'}; } $n++; $z+= qnorm($ipairs{$fk}{$pair}{'pvalue'}) } } if($n>2){ $z = $z/sqrt($n); $pvalue = pnorm($z); } if ($pvalue) { #printf "$pair -> %.4f\n", $pvalue; printf OF "$hl %.4f $n\n", $pvalue; } } close OF;

Actually, the program works fine. However my problem is that it incrementally consumes memory until it gets the 32Mb. Finally the system kill the job by itself, so my program never finish.

So, I have two questions.

Why is this happening? The high memory waste begins after all the info is already loaded in the hash. In oder words, in the loop when calculations take place and results are writting to disk.

There is any workaround to sort this problem? Actually I was thinking in writing intermediate results to disks but I'm not yet sure how to do it.

Replies are listed 'Best First'.
Re: memory issues
by BrowserUk (Pope) on Jan 28, 2013 at 09:13 UTC

    The problem is that this line:


    Rather that just testing if that value exists, it is autovivifying (creating) that value in the nested hashes and setting it to null.

    If you change that line to:

    if( exists $ipairs{$fk} && exists $ipairs{$fk}{$pair} && exists $ipairs{$fk}{$pair}{'pvalue'} ){

    It should prevent the runaway memory growth. As a nice side-effect, your program should run substantially faster also.

    BTW. I assume you mean 32GB not 32MB?

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks!!!! It works fine now and, as you said, substantially faster also. This is what I did,
      foreach my $pair (@lpairs) { my $n = 0; my $z = 0; my $hl; my $pvalue = 0; my $pvt = 0; my $fk; foreach $fk (%ipairs) { if( exists $ipairs{$fk} && exists $ipairs{$fk}{$pair} && exi +sts $ipairs{$fk}{$pair}{'pvalue'}){ #if($ipairs{$fk}{$pair}{'pvalue'}){ $pvt = $ipairs{$fk}{$pair}{'pvalue'}; if($pvt){ unless($hl){ $hl = $ipairs{$fk}{$pair}{'head'}; } $n++; $z+= qnorm($ipairs{$fk}{$pair}{'pvalue'}); } } } if($n>2){ $z = $z/sqrt($n); $pvalue = pnorm($z); } if ($pvalue) { #printf "$pair -> %.4f\n", $pvalue; printf OF "$hl %.4f $n\n", $pvalue; } }

        Just a small hint to reduce the verboseness of your code:

        ($ipairs{$fk}{$dline}{'head'}, $ipairs{$fk}{$dline}{'effect'}, $ip +airs{$fk}{$dline}{'pvalue'}) = $ldata{$dline} =~ /^(.*)\s+(\d\.\d+)\s ++\d\.\d+\s+(\d\.\d+)$/;

        can be rewritten using a hash slice like this:

        ( @{ $ipairs{$fk}{$dline} }{qw/head effect pvalue/} ) = $ldata{$dl +ine} =~ /^(.*)\s+(\d\.\d+)\s+\d\.\d+\s+(\d\.\d+)$/;

        Otherwise, your code benefits from a temporary variable or two. Here I repurpose $pvt (not sure if the variable name makes sense after that):

        $pvt = $ipairs{$fk}{$pair}; if($pvt->{'pvalue'}){ unless($hl){ $hl = $pvt->{'head'}; } $n++; $z+= qnorm($pvt->{'pvalue'}); }

        (This only works because there already exists a hash reference at $ipairs{$fk}{$pair}. It would not work if you tried to say $pvt = {}, but %$pvt = () would. It's all reference magic and not really easy to explain unless you first understand pointers.)

        (Of course, the hash slice can be rewritten using a temporary variable, too. It's always a good idea to use temporary variables if it makes your code easier to understand. Triply a good idea if it reduces repetition.)

      And you are right. It is 32 GB. :-)
Re: memory issues
by Anonymous Monk on Jan 28, 2013 at 09:08 UTC

    Why is this happening?

    You wrote it that way, you're storing that much data in memory


    $ perl -MDevel::Size=:all -le " @F = 1 .. (5 * 30 * 1024 ); @F{@F}=@F +; print total_size($_) for \@F, \%F " 6758436 13225503

    5 files, 30k lines each, stored in hash, and stored in array, 6.5MiB and 13MiB respectively ( 19.5MiB combined)

    You actually store three times as much data, mostly duplicated

    The solution, store less data, store data on disk, get a better system, or lift ulimits on your account

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1015642]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2018-11-14 19:21 GMT
Find Nodes?
    Voting Booth?
    My code is most likely broken because:

    Results (177 votes). Check out past polls.