Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

XML::Twig loves to eat my memory

by carcus88 (Acolyte)
on Jul 22, 2010 at 19:07 UTC ( [id://850892]=perlquestion: print w/replies, xml ) Need Help??

carcus88 has asked for the wisdom of the Perl Monks concerning the following question:

So I have about 85 XML files that are 25-30mb each I am trying to process with XML::Twig. The problem is that I cannot seem to get XML::Twig to release the memory it consumes and my script quickly dies a bloated memory related death after a number of files. In a nutshell this is what I am doing. - Parse file using XML::Twig - Get just the IDs for each record - Lookup that ID in database and do some stuff - Process next record until end of XML file and repeat for next file. I threw it in debug and I can see the memory grow as each bit of XML is parsed. Can anyone see anything wrong with this code?
#!/usr/bin/perl -w use strict; use XML::Twig; use DBI; use DBD::Pg; use SQL::Abstract; use File::Copy; use File::Basename; my $inFile = 'data_100000_100500.xml'; if ( ! $inFile ) { die("No input file specified"); } if ( ! -f $inFile ) { die("file '$inFile' not found"); } my $dbname = "test"; my $user = "test"; my $password = "test"; my $host = ""; my $port = "5432"; my $dbh = DBI->connect("dbi:Pg:dbname=$dbname;host=$host;port=$port", +$user, $password, {AutoCommit => 0}); my $sql = SQL::Abstract->new(quote_char=>'"'); my @missing; my @localmissing; my $trust = 0; my $localcount; my $count; my $sth; my $fileStartID; my $fileEndID; my %BIOG; process($inFile); if (@missing) { open(MISSINGFILE, ">>missing.txt"); foreach my $missing (@missing) { print MISSINGFILE $missing . "\n"; } close MISSINGFILE; print "\nUNVERIFIED see missing.txt for missing records.\n"; } else { print "\nVerified 100%\n"; } exit 0; # # Process the file # sub process { %BIOG = (); $inFile =~ /data_(\d+)_(\d+)/; $fileStartID = $1; $fileEndID = $2; $localcount = 0; print "Processing file " . $inFile . "\t"; my $t= new XML::Twig( TwigHandlers=> { BIOG => \&BIOG }, ); $t->parsefile( $inFile ); $t->dispose(); # Try to Free memory but does not work... if ( @localmissing ) { push(@missing,@localmissing); my $missing = @localmissing; print "Missing $missing/$localcount \n"; } else { print "Verified 100%\n"; my $folder = dirname($inFile); $folder =~ s/data_done/data_verified/; move($inFile, $folder.basename($inFile)); } } # # BIOG is XML element we are triggering # sub BIOG { my ($t, $BIOG)= @_; ++$localcount; if ( ! checkBiog($BIOG->field('BIOG_NBR')) ) { push(@localmissing, $BIOG->field('BIOG_NBR')); } $t->purge(); # Tell XML::Twig to dispo of the rest of the tree we + don't care about return 1; } # # Check database for ID # sub checkBiog { my ($biog) = @_; if ( !%BIOG ) { my %where = ( BIOG_NBR => { -between => [ $fileStartID, $fileEndID ] }, ); my($stmt, @bind) = $sql->select('BIOG', '"BIOG_NBR"', \%where) +; if (!$sth) { $sth = $dbh->prepare($stmt); } my $result = $sth->execute(@bind); while(my $data = $sth->fetchrow_hashref()) { $BIOG{$data->{BIOG_NBR}} = 1; } } if(defined($BIOG{$biog})) { return 1; } else { return 0; } }

Replies are listed 'Best First'.
Re: XML::Twig loves to eat my memory
by almut (Canon) on Jul 22, 2010 at 20:02 UTC

    To narrow down on the issue, I would remove (comment out) everything not directly related to XML::Twig, and implement a dummy handler for BIOG.

    In case the problem should disappear, step by step add back the original functionality until memory leaks again...

      I have tried this. Its not the database consuming the memory and when in debug I can watch the process size grow as I step through the twig parsing. Even if I break down the script into just XML::Twig parsing I can never get it to not consume memory even after $t->dispose() is called which according to the docs is supposed to work.

        In this case, create a minimal example that allows to replicate the problem (maybe with programmatically generated XML files), and submit a bug report — after having checked that you're using an up-to-date version of the module.

Re: XML::Twig loves to eat my memory
by intel (Beadle) on Jul 22, 2010 at 19:27 UTC
    I think you need an


    and probably a


      $sth->finish() has no effect since $sth->fetchrow_hashref() is an while loop. When sth reaches end of recordset it effectively calls sth->finish(). The DBI docs back this up.
Re: XML::Twig loves to eat my memory
by AndyZaft (Hermit) on Jul 22, 2010 at 20:17 UTC
    I'm sure mirod will tell you sooner or later what you can do to make it better. Maybe it's just needs another purge/flush before the dispose to tell twig that it is finished with the root element too.
      Something else just occurred to me is that in this case it might be beneficial to use twig_roots since you are only interested in 1 field anyway. Probably not going to help with the memory issues much in the end, but it might crash later :)
        twig_roots does prove to be most useful for when you only want one little bit of data. A single file only consumes 19MB as opposed to 166MB. However it still does not release memory when I try to $t->dispose(). This also does not solve my problem for scripts that have to process all the data.
Re: XML::Twig loves to eat my memory
by ahmad (Hermit) on Jul 22, 2010 at 22:09 UTC

    I'm not an expert dealing with XML::Twig I usually use XML::Simple but anyway, reading the docs I think you might need to use the flush method too not only the dispose

    You might need something like $BIOG->flush; inside your BIOG subroutine

Re: XML::Twig loves to eat my memory
by mirod (Canon) on Jul 26, 2010 at 09:50 UTC

    The problem is due to a bug in perl 5.10.0, maybe RT 56908.

    In any case it is fixed in 5.10.1 and above, and nothing can be done to fix it in 5.10.0. Can you update your version of Perl?

      Upgrade to perl 5.10.1 and the problem has been fixed. No more leaking memory. Thanks mirod :)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://850892]
Approved by toolic
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-22 05:28 GMT
Find Nodes?
    Voting Booth?

    No recent polls found