LukeyBoy has asked for the wisdom of the Perl Monks concerning the following question:

So I've been spending time working on yet another online bookmark repository, and mine can take snapshots of pages so that you can view a cached copy later if the original page is taken down or altered.

The kicker comes when I'm exporting and importing the user's data - I use XML::Writer to output the data, and I export the bookmarks and cache objects into an XML file. The cached objects are encoded using MIME::Base64. Exporting works like a charm...

The problem is when importing data. Perl runs out of memory! I have an XML::Parser object created with appropriate handlers that branch depending on the element type. The export format contains three main elements - "post", "object" and "relationship". (A relationship is a link between two objects). So on import, relationships and posts import fine but Perl has the memory issues when reading and decoding the Base64 encoded objects.

The XML::Parser fires events into my "Char" handler, where I append the characters in the current element to a scalar...

Like so:
$parser->setHandlers(Char => sub { my $expat = shift; $ocontent = $ocontent . shift; }...

The $ocontent scalar is defined in the scope of the function that instantiates and starts the parse. So I build up the $ocontent with the encoded data and then finally I call decode on it and import it into the user's database. That works well for somewhere around two hundred objects, at which point Perl freaks out trying to map and unmap memory (according to strace) and stalls at full CPU usage. When I comment out the "$ocontent = $ocontent . shift;" line, I don't run out of memory. And I tried setting the ocontent variable once, thinking that maybe the Base64 decode method or the MySQL DBI methods were causing the error - so in that test I used the same chunk of data for all objects, and I did not run out of memory.

So all signs point to my character buffer causing a memory leak. Does anyone know how to fix or workaround this? (Also you can look right at the CVS tree of my project if it helps).

Update: I figured it out! After appending the characters from XML::Parser to my string I now undef the expat character variable. Suddenly the whole script moves way faster and uses less memory. This is the new character data handling routine:

Char => sub { my $expat = shift; my $chars = shift; $cbuffer = $cbuffer . $chars; undef $chars; }

Replies are listed 'Best First'.
Re: Out of memory with XML::Parser
by runrig (Abbot) on Sep 14, 2005 at 17:39 UTC
    Is the $ocontent accumulating throughout the whole parse, or is it occasionally reset, and how big does it get? Maybe XML::Twig (xmltwig.com) would help reduce memory usage (if you can periodically purge what's been parsed so far)? I don't know. I don't know enough about your code or what you're trying to do.
      It's reset at the end of each "object" element in the XML. The objects are about 50-60 kilobytes of Base64-encoded data, and my test file has 400 of these elements. The $ocontent variable is reset at the EndTag event of each object element. I'll take a look at XML::Twig...
      Twig also explodes, as soon as it hits the first large PCDATA section (the 'object' element).
        When you say "the first large" section, is that on the first object? Does it parse any objects? Are you calling purge or flush after each "object" section?
Re: Out of memory with XML::Parser
by mirod (Canon) on Sep 14, 2005 at 19:49 UTC

    I am not sure what the bug is, and if it is related to the bug runrig pointed at. One thing you could try: trace (or use the debugger if you can) what's in the second argument of the Char handler. Is it really the string you are looking for?

      Strange, the character data being returned is waaaay too large for the object in question. The object is a base64-encoded JPEG in this case and is about 50k of data, yet the Char subroutine is fired way more times than there is lines of data in the XML for that object. I'll keep poking around.

        Now we are getting somewhere.

        Try looking at $expat->recognized_string or at $expat->original_string, see if they have what you are looking for on the first call to the Char handler.

        Is the data in regular (PCDATA) text, or is it in a CDATA section?