Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
(XML::Parser) Finding and fixing a bugby Juerd (Abbot) |
on Apr 19, 2003 at 11:03 UTC ( [id://251630]=perlmeditation: print w/replies, xml ) | Need Help?? |
Update (Using Perl 5.8.0 and XML::Parser 2.31) Have you ever noticed that many scripts that use XML::Parser leak memory? If you use those scripts in a long lived environment like mod_perl, you may have. XML::Parser's main method is parse. The method takes as its first argument either a complete XML document or a filehandle. This flexibility is very expensive if not handled correctly. After migrating a legacy web application to mod_perl (using Apache::PerlRun because the code is not at all strict-compliant), I noticed that Apache's processes used a lot of memory. Hoping to find out where all that memory went, I used Devel::Symdump. It very quickly became one of my favourite modules. Devel::Symdump->rnew->as_string is teriffic. I was going to dump the contents of every scalar, every array and every hash, but I let it dump the entire symbol table first, to find out if Devel::Symdump actually worked properly. I didn't even have to print the contents of scalars, arrays or hashes. The simple as_string dump told me that entire XML documents were used as symbols! In case you don't know what a symbolic reference is, or how you can use an XML document as a symbol, have a look at this code: In $$bar, $bar is called a symbolic reference because it refers by the symbol ("foo") of another variable. Because my old code uses symbolic references, I traced every single occurrence and added debugging print statements. To no avail. But Devel::Symdump's diff method told me that the symbols were created during the parsing. Taking everything out, I ended up with this very small test script:
Even in this small script, where no symbolic reference is used, the entire XML document was used as a symbol. So the problem had to be somewhere in XML::Parser itself, which was both disappointing and frustrating. Debugger to the rescue! So I used perl -d and traced the code, dumping a lot of diffs. Somewhere in XML::Parser::Expat lives a parse method, which takes an XML document or a filehandle as its first argument. The problem is with how it finds out if $arg is a file handle. XML::Parser::Expat assumes that $arg is a filehandle if it is-a IO::Handle, which makes sense. If $arg is tied to a class that has a TIEHANDLE method, it is also assumed to be a file handle. By the way, I think that part of the code doesn't work, because it uses my $class = ref($arg);, while ref(tied($arg)) should have been used. As a fallback method, it tries to see if *{$arg}{IO} exists. Oops! My $arg is an XML document, not a filehandle! It is used as a symbolic reference anyway, and since a symbol table is a hash, the symbol is autovivificated (automatically created because it did not exist). Package global variables have no end of scope, so the XML document remains in memory for the rest of the interpreter's life. And with something that handles a lot of documents, that means a lot of memory is wasted. The solution is quite easy: don't allow an arbitrary string to be used as a symbol. Ever. But that would probably break some existing applications. Even though being able to use a string as a filehandle is a possible security bug, I chose to not break backwards compatibility. If you use *{$arg}, the symbol is autovivificated. If you only test if it is defined, it is not. This means that adding if defined *{$arg} would fix the memory leak. Note: Not if defined *{$arg}{IO}, because that first autovivifies *{$arg} to be able to look at its IO element! Line 456 of my XML/Parser/Expat.pm now reads: Instead of:
I tried to report this bug, along with a patch, to the maintainer, but his e-mail address appears to be no longer in use. If anyone knows how to contact Clark Cooper, please send him this page so the problem can be fixed in the real distribution. Avoiding problems like this is easy:
My Apache processes are now approximately 15 megabytes in size of which 10 are shared. Much better :)
Back to
Meditations
|
|