Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Strange memory growth

by spica1001 (Initiate)
on Feb 14, 2018 at 18:02 UTC ( #1209158=perlquestion: print w/replies, xml ) Need Help??
spica1001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. Here's an odd one, to me anyway. I'm running a script to process very large XML files with embedded JSON. When I run it, the memory increases indefinitely (the file can be 100s of GB and the RAM usage reaches 25GB+). Boiling it down, I reach the below. If I remove the line marked "## THIS LINE", the memory remains static, but leave it in and it increases again. Adding in a load of undefs seems to make no difference. It's evidently leaving the hash array around, but I can't see how to 'free' it.

Why would accessing a non-existent hash value cause that, or of course even better how do I prevent it?! Thanks for any help...

use JSON; open(IN,"<:utf8","$ARGV[0]"); while(<IN>) { if (m!^\s+<text.*?>({[^\{\|].+})</text>!) { my $jt = $+; $jt=~s/\&quot;/\"/g; my $json = new JSON; my $jp = $json->allow_nonref->utf8->relaxed->decode($jt); my $c = $jp->{'claims'}; # "claims":{"P31":[{"mainsnak":{"snaktype":"value","property": +"P31","hash":"...","datavalue":{"value":{"entity-type":"item","numeri +c-id":5},...}... }...}], if (ref($c) eq 'HASH') { foreach my $ch (keys %$c) { if (ref($c->{$ch}) eq 'ARRAY') { foreach my $cg (@{$c->{$ch}}) { if (defined $cg->{'mainsnak'}->{'datavalue'}-> +{'value'}->{'notexist'}) {} ## THIS LINE } } } } } }


UPDATE: Thanks to all the replies, I've worked it out now.

Points should go to tinita as the suggestion of use strict pointed me in the right direction. It turned out that, in some of the lines that my script reads, the value $cg->{'mainsnak'}->{'datavalue'}->{'value'} is a string, not a hash. It seems treating a string as a hash causes the memory growth. I fixed it with:

if (ref($cg->{'mainsnak'}->{'datavalue'}->{'value'}) eq 'HASH' && defined $cg->{'mainsnak'}->{'datavalue'}->{'value'}->{'notexist'}) {}

(Of course my code does plenty else besides this, but the principle of needing to check that a variable is indeed a hash before checking for a key is the main takeaway here.)

Replies are listed 'Best First'.
Re: Strange memory growth
by ikegami (Pope) on Feb 14, 2018 at 18:26 UTC

    In lvalue context,

    is equivalent to
    ( $var //= {} )->{$key}
    defined( $cg->{mainsnak}{datavalue}{value}{notexist} )

    is equivalent to

    defined( ( ( ( ( $cg //= {} )->{mainsnak} //= {} )->{datavalue} //= {} + )->{value} //= {} )->{notexist} )

    That could potentially create a lot of new hashes.

    Solution 1:

    if ( $cg && $cg->{mainsnak} && $cg->{mainsnak}{datavalue} && $cg->{mainsnak}{datavalue}{value} && defined( $cg->{mainsnak}{datavalue}{value}{notexist} ) ) { ... }

    Solution 2:

    no autovivification; if (defined( $cg->{mainsnak}{datavalue}{value}{notexist} )) { ... }
      That could potentially create a lot of new hashes.
      But only three, and $cg (and $jp) fall out of scope after every iteration. I can't see why this would continually increase memory.

        Indeed. As such, I find it hard to believe the line you identified has anything to do with a memory leak. Is that the program you actually ran?

Re: Strange memory growth
by huck (Vicar) on Feb 14, 2018 at 18:28 UTC

      Thanks all for the excellent responses. To answer the various suggestions and questions:

      Yes, I posted the exact script that I'm running. Perl version 5.22.1, Ubuntu 16.04.

      Trying huck's or ikagami's suggestions didn't change things, unfortunately.

      Running on the same line repeatedly shows no memory growth, which I can't explain. It's thus possible it's only certain lines of the input file that cause the problem but I'm struggling to see how to identify which.

      Thus it's difficult for me to post data on which I see the problem. The file I'm reading in is the Wikidata data dump (33GB bzipped) (I usually read it in with IO::Uncompress::Bunzip2 but verified that the issue still occurs when unzipped by taking the first 2GB of that file unzipped)

      I'll work on trying to get Data::Diver working and identify problematic lines of the input file.

Re: Strange memory growth
by tinita (Parson) on Feb 14, 2018 at 18:42 UTC
    Which perl version are you using, which OS?
    I don't see a specific problem with the code. It should have a use strict; at the beginning, though.
    How long is a typical JSON string?
    Can you create a sample input that will show memory increase? For example, instead of reading in the file, just loop over the same line a couple of 1000 times and watch memory consumption.
    The problem the others are mentioning doesn't look like a problem to me since $jp falls out of scope after every loop iteration.
    Memory stays the same for me if I run this 500,000 times on an example line.

    edit: specifically, how many array elements does a typical structure have? my assumption was that you have many lines, but maybe your lines/JSON strings are very big?
      I'm pretty sure that auto-vivification is the root-cause problem here, because a chain of hashrefs was used in a single expression, causing unwanted AV to occur from left to right. And I am also pretty sure that huck had the best answer.
Re: Strange memory growth
by Marshall (Abbot) on Feb 15, 2018 at 06:36 UTC
    I updated my post with some more test cases. defined and exists behave similarly in regards to generating intermediate keys in the process of doing their work. I show an example of how to prevent this below.

    I don't know if this helps or not, but checking for defined vs or exists can generate hash entries.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash = ('key' => 'value'); print Dumper \%hash; #prints: #$VAR1 = { # 'key' => 'value' # }; print "defined\n" if defined $hash{abc}{xyz}; # will create new keys print "defined\n" if defined ($hash{x}) and defined ($hash{x}{y}); # a +dded: won't create new keys print "exists\n" if exists ($hash{x}) and exists($hash{x}{y}); # a +dded: won't create new keys print "exists\n" if exists $hash{x1}{x2}; # added: will create new +keys print "defined\n" if defined $hash{x3}; # added: no new key print Dumper \%hash; __END__ #prints: # The "abc" key is created due to the check for defined of 2nd dimensi +on # exists will also create new intermediate keys $VAR1 = { 'key' => 'value', 'abc' => {}, 'x1' => {} };
    Sometimes you have to check at each level if the hash "dimension" exists. A check for "defined" or "exists" can generate automatic dimensions.
    Sometimes you have to you have to check whether some hash key exists at all before checking what its value is (defined or not).
      So are you saying that Huck's post should have used exists, instead of defined? To be absolutely clear, can you post a quick example similar to Huck's? Auto-vivification is obviously the root cause problem here ... but are you saying that defined will auto-viv too? (That does seem counter-intuitive, but ...)

        No difference between exists and defined in this regard. In both cases, the *intermediate* levels will spring into existance. The last level isn't dereferenced as a hash.

        If you have perl installed, you can try it out yourself:

        $ perl -e 'use Data::Dump; exists $a->{foo}{bar}; dd $a;' { foo => {} } $ perl -e 'use Data::Dump; defined $a->{foo}{bar}; dd $a;' { foo => {} }

Re: Strange memory growth (Data::Diver)
by Anonymous Monk on Feb 17, 2018 at 20:46 UTC

    Classic solution is to use Data::Diver

    use Data::Diver qw/ Dive /; ## if (defined $cg->{'mainsnak'}->{'datavalue'}->{'value'}->{'no +texist'}) {} ## THIS LINE if ( defined Dive( $cg, qw' mainsnak datavalue value notexist ' + ) ) { }


    $ perl -e "use Data::Diver qw/Dive/; $f{a}{b}{c}=666; print Dive(\%f, +qw/a b c Q/); print Dive(\%f, qw/a b c/); " 666

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1209158]
Approved by marto
Front-paged by haukex
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2018-07-21 10:18 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (446 votes). Check out past polls.