Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Update
The problem described in this post has been fixed in XML::Parser 2.32.

(Using Perl 5.8.0 and XML::Parser 2.31)

Have you ever noticed that many scripts that use XML::Parser leak memory? If you use those scripts in a long lived environment like mod_perl, you may have.

XML::Parser's main method is parse. The method takes as its first argument either a complete XML document or a filehandle. This flexibility is very expensive if not handled correctly.

After migrating a legacy web application to mod_perl (using Apache::PerlRun because the code is not at all strict-compliant), I noticed that Apache's processes used a lot of memory.

Hoping to find out where all that memory went, I used Devel::Symdump. It very quickly became one of my favourite modules. Devel::Symdump->rnew->as_string is teriffic. I was going to dump the contents of every scalar, every array and every hash, but I let it dump the entire symbol table first, to find out if Devel::Symdump actually worked properly.

I didn't even have to print the contents of scalars, arrays or hashes. The simple as_string dump told me that entire XML documents were used as symbols!

In case you don't know what a symbolic reference is, or how you can use an XML document as a symbol, have a look at this code:

no strict; $foo = "Hello, world!\n"; $bar = "foo"; print $$bar;
In $$bar, $bar is called a symbolic reference because it refers by the symbol ("foo") of another variable.

Because my old code uses symbolic references, I traced every single occurrence and added debugging print statements. To no avail. But Devel::Symdump's diff method told me that the symbols were created during the parsing.

Taking everything out, I ended up with this very small test script:

use strict; use Devel::Symdump; use XML::Parser; my $parser = XML::Parser->new Handlers => { Start => sub { }, End => sub { }, Char => sub { }, XMLDecl => sub { }, } ); my $wml = `cat test.wml`; my $dump = Devel::Symdump->rnew; $parser->parse($wml); print $dump->diff(Devel::Symdump->rnew);

Even in this small script, where no symbolic reference is used, the entire XML document was used as a symbol.

So the problem had to be somewhere in XML::Parser itself, which was both disappointing and frustrating. Debugger to the rescue! So I used perl -d and traced the code, dumping a lot of diffs.

Somewhere in XML::Parser::Expat lives a parse method, which takes an XML document or a filehandle as its first argument. The problem is with how it finds out if $arg is a file handle.

XML::Parser::Expat assumes that $arg is a filehandle if it is-a IO::Handle, which makes sense. If $arg is tied to a class that has a TIEHANDLE method, it is also assumed to be a file handle. By the way, I think that part of the code doesn't work, because it uses my $class = ref($arg);, while ref(tied($arg)) should have been used.

As a fallback method, it tries to see if *{$arg}{IO} exists.

Oops! My $arg is an XML document, not a filehandle! It is used as a symbolic reference anyway, and since a symbol table is a hash, the symbol is autovivificated (automatically created because it did not exist). Package global variables have no end of scope, so the XML document remains in memory for the rest of the interpreter's life. And with something that handles a lot of documents, that means a lot of memory is wasted.

The solution is quite easy: don't allow an arbitrary string to be used as a symbol. Ever. But that would probably break some existing applications. Even though being able to use a string as a filehandle is a possible security bug, I chose to not break backwards compatibility.

If you use *{$arg}, the symbol is autovivificated. If you only test if it is defined, it is not. This means that adding if defined *{$arg} would fix the memory leak. Note: Not if defined *{$arg}{IO}, because that first autovivifies *{$arg} to be able to look at its IO element!

Line 456 of my XML/Parser/Expat.pm now reads:

$ioref = *{$arg}{IO} if defined *{$arg};
Instead of:
$ioref = *{$arg}{IO};

I tried to report this bug, along with a patch, to the maintainer, but his e-mail address appears to be no longer in use. If anyone knows how to contact Clark Cooper, please send him this page so the problem can be fixed in the real distribution.

Avoiding problems like this is easy:

  • Don't allow two different things to be used in the same way: use separate methods (parse_string and parse_filehandle) or use named arguments (parse(string => ...) and parse(fh => ...)).
  • Don't allow a string to be used as a filehandle. (parse(\*FOO) instead of parse('FOO') or even parse(FOO))

My Apache processes are now approximately 15 megabytes in size of which 10 are shared. Much better :)

Juerd
- http://juerd.nl/
- spamcollector_perlmonks@juerd.nl (do not use).


In reply to (XML::Parser) Finding and fixing a bug by Juerd

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (10)
    As of 2014-12-25 00:26 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (159 votes), past polls