Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

(XML::Parser) Finding and fixing a bug

by Juerd (Abbot)
on Apr 19, 2003 at 11:03 UTC ( #251630=perlmeditation: print w/replies, xml ) Need Help??

The problem described in this post has been fixed in XML::Parser 2.32.

(Using Perl 5.8.0 and XML::Parser 2.31)

Have you ever noticed that many scripts that use XML::Parser leak memory? If you use those scripts in a long lived environment like mod_perl, you may have.

XML::Parser's main method is parse. The method takes as its first argument either a complete XML document or a filehandle. This flexibility is very expensive if not handled correctly.

After migrating a legacy web application to mod_perl (using Apache::PerlRun because the code is not at all strict-compliant), I noticed that Apache's processes used a lot of memory.

Hoping to find out where all that memory went, I used Devel::Symdump. It very quickly became one of my favourite modules. Devel::Symdump->rnew->as_string is teriffic. I was going to dump the contents of every scalar, every array and every hash, but I let it dump the entire symbol table first, to find out if Devel::Symdump actually worked properly.

I didn't even have to print the contents of scalars, arrays or hashes. The simple as_string dump told me that entire XML documents were used as symbols!

In case you don't know what a symbolic reference is, or how you can use an XML document as a symbol, have a look at this code:

no strict; $foo = "Hello, world!\n"; $bar = "foo"; print $$bar;
In $$bar, $bar is called a symbolic reference because it refers by the symbol ("foo") of another variable.

Because my old code uses symbolic references, I traced every single occurrence and added debugging print statements. To no avail. But Devel::Symdump's diff method told me that the symbols were created during the parsing.

Taking everything out, I ended up with this very small test script:

use strict; use Devel::Symdump; use XML::Parser; my $parser = XML::Parser->new Handlers => { Start => sub { }, End => sub { }, Char => sub { }, XMLDecl => sub { }, } ); my $wml = `cat test.wml`; my $dump = Devel::Symdump->rnew; $parser->parse($wml); print $dump->diff(Devel::Symdump->rnew);

Even in this small script, where no symbolic reference is used, the entire XML document was used as a symbol.

So the problem had to be somewhere in XML::Parser itself, which was both disappointing and frustrating. Debugger to the rescue! So I used perl -d and traced the code, dumping a lot of diffs.

Somewhere in XML::Parser::Expat lives a parse method, which takes an XML document or a filehandle as its first argument. The problem is with how it finds out if $arg is a file handle.

XML::Parser::Expat assumes that $arg is a filehandle if it is-a IO::Handle, which makes sense. If $arg is tied to a class that has a TIEHANDLE method, it is also assumed to be a file handle. By the way, I think that part of the code doesn't work, because it uses my $class = ref($arg);, while ref(tied($arg)) should have been used.

As a fallback method, it tries to see if *{$arg}{IO} exists.

Oops! My $arg is an XML document, not a filehandle! It is used as a symbolic reference anyway, and since a symbol table is a hash, the symbol is autovivificated (automatically created because it did not exist). Package global variables have no end of scope, so the XML document remains in memory for the rest of the interpreter's life. And with something that handles a lot of documents, that means a lot of memory is wasted.

The solution is quite easy: don't allow an arbitrary string to be used as a symbol. Ever. But that would probably break some existing applications. Even though being able to use a string as a filehandle is a possible security bug, I chose to not break backwards compatibility.

If you use *{$arg}, the symbol is autovivificated. If you only test if it is defined, it is not. This means that adding if defined *{$arg} would fix the memory leak. Note: Not if defined *{$arg}{IO}, because that first autovivifies *{$arg} to be able to look at its IO element!

Line 456 of my XML/Parser/ now reads:

$ioref = *{$arg}{IO} if defined *{$arg};
Instead of:
$ioref = *{$arg}{IO};

I tried to report this bug, along with a patch, to the maintainer, but his e-mail address appears to be no longer in use. If anyone knows how to contact Clark Cooper, please send him this page so the problem can be fixed in the real distribution.

Avoiding problems like this is easy:

  • Don't allow two different things to be used in the same way: use separate methods (parse_string and parse_filehandle) or use named arguments (parse(string => ...) and parse(fh => ...)).
  • Don't allow a string to be used as a filehandle. (parse(\*FOO) instead of parse('FOO') or even parse(FOO))

My Apache processes are now approximately 15 megabytes in size of which 10 are shared. Much better :)

- (do not use).

Replies are listed 'Best First'.
(jeffa) Re: (XML::Parser) Finding and fixing a bug
by jeffa (Bishop) on Apr 19, 2003 at 16:17 UTC
    Did you try RTCPAN? The page for reporting bugs for XML::Parser is here.

    Nice detective work, by the way! :)

    Good point - i had to ask anyway. Anyhoo, this bug report looks strangely similar ...


    (the triplet paradiddle with high-hat)

      Anyhoo, this bug report looks strangely similar ...

      Argh! "It doesn't work with strict, so let's disable strict."-people. Ack! Most of the time, strict complains because you're doing something wrong, not just because it likes to complain (it does like to complain, but that is not the point).

      There is one thing that I don't understand, though. The bug reporter uses XML::Parser 2.31, and so do I. But in his version, he had to add "no strict 'refs';" himself, while it was already there in mine.

      Let's see... Is it in the distribution? tells me it isn't. Did Debian fix it then? Yes, Debian did do this.

      /me wanders off to file a bug report to Debian...

      - (do not use).

Re: (XML::Parser) Finding and fixing a bug
by vek (Prior) on Apr 19, 2003 at 21:05 UTC
    First of all, Juerd++, great detective work.

    If anyone knows how to contact Clark Cooper, please send him this page so the problem can be fixed in the real distribution.

    I think that Clark Cooper has all but disappeared from the face of the earth. There was a thread on the perl-xml mailing list at the beginning of this year regarding that very problem. I know that matts offered to maintain the module but I've since dropped off the list and don't know how much further that went.

    Try posting your findings to the mailing list and see who bites...

    -- vek --
      I don't buy it. XML-Paser-3.21 was released April 02, 2002. Sure, today is April 19 (PST), and if Clark has been MIA until today, but being MIA for 17 days is not cause for alarm.

      update: D'oh, I see, 2002 (thanks grantm, what's wrong with me). Somebody ought to takeover maintenance then (it's been enough talk, it's time for action), or at least create an emergency release.

      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

        XML-Parser-3.21? Surely you mean XML-Parser 2.31 :-)
Re: (XML::Parser) Finding and fixing a bug
by grantm (Parson) on Apr 19, 2003 at 21:38 UTC

    ++Juerd great detective work!

    I raised this exact issue on the perl-xml list earlier this week after I received email from an XML::Simple user who was having problems with a memory leak. I did manage to reduce it to a simple test case that involved only XML::Parser, but I hadn't worked out how to proceed with identifying the cause. (I foolishly assumed Devel::Symdump would only work with package globals and was unlikely therefore to help in my quest).

    vek is correct that matts offered to take XML::Parser under his wing earlier this year. However, he's been very busy recently so if someone else wanted to volunteer, then I'd suggest taking the discussion to the perl-xml mailing list.

Re: (XML::Parser) Finding and fixing a bug
by mattr (Curate) on Apr 20, 2003 at 13:09 UTC
    /me returns from web reconaissance mission.. and I think part of CPAN think coopercl |at| works.
    CPAN's author page and the PAUSE author index has clark |at| (changed May 2002 )
    The coopercc |at| account mentioned in the latest pod is closed.
    Cooper left the expat sourceforge project prior to June 21, 2002 according to a post on expat ML by Fred Drake who runs it now, but does python not perl (refers us to Activestate perl-xml ML)

    Matt Sergeant said in January 2003 on perl-xml that Cooper is at his (quiet) website but not doing much perl stuff these days. Cooper does not respond to most perl questions (though maybe he did to Matt S.?) Several people on this activestate thread in the end of January supported his trying to become co-owner of the project.

    Cooper's real address (probably equivalent to the address) based on whois is per6 |at|, updated less than two weeks ago: 08-Apr-2003. I would not recommend calling him, though maybe if it is too much work for one person and many people are frustrated a new collaborative sourceforge project might be in order? (not me, I don't use it enough)..
Re: (XML::Parser) Finding and fixing a bug
by Juerd (Abbot) on Apr 20, 2003 at 13:50 UTC
Re: (XML::Parser) Finding and fixing a bug
by Matts (Deacon) on Apr 20, 2003 at 22:03 UTC
    I have now requested dual ownership of XML::Parser. If/when this happens I will apply the bug fix and a few other changes I wish to make and release an update.
      Hi Matt,

      I'm getting a memory leak in some old code I have inherited - it does this:

      sub getXMLTree{ my $parser = new XML::DOM::Parser; $tree = ""; if(!open(MYFILE,"$fileName")) { $tree = $parser->parse("<root></root>"); $tree = createHeader($tree); $tree = createBody($tree); }else{ close(MYFILE); $tree = $parser->parsefile($fileName); } $tree = updateHeader($tree); return ($tree); }
      $filename being a string like "/tmp/myfile.xml"

      Can you (or anyone) clarify, should I change how this is done or get a new version of XML::DOM::Parser (I just installed it last week so I cant see how that would help tho)

      ___ /\__\ "What is the world coming to?" \/__/
Re: (XML::Parser) Finding and fixing a bug
by grantm (Parson) on Apr 21, 2003 at 21:50 UTC

    I'm still struggling to understand what the original (buggy) code was trying to achieve. Was it to enable something like this to actually work:

    use XML::Parser; open CONFIG, ('config.xml') || die $!; my $parser = XML::Parser->new(Style => 'Tree'); my $config = $parser->parse('CONFIG');

    If so, that doesn't seem to work even using the unpatched version of the module. Can anyone give a working example of providing a symbol name to the parse method? If the original code does not actually work as intended, then a patched version wouldn't need to worry about backwards compatibility.

      my $config = $parser->parse('CONFIG');

      my $config = $parser->parse(*CONFIG);
      But that would better be checked with something like ref(\$arg) eq 'GLOB'.

      But... XML::Parser::Expat::parse allows strings to be used. In your try, it'll try to use XML::Parser::Expat::CONFIG. Things change if your symbolic reference contains :::

      my $config = $parser->parse('main::CONFIG');
      It is unlikely that someone uses a string constant there, but it COULD be some $foo that is the result of whatever, which could in turn be some stringified version of *CONFIG. *CONFIG stringifies to *main::CONFIG, and with that * there it still works:
      my $config = $parser->parse('*main::CONFIG');

      I didn't want to break (stupid) code like this:

      sub something_that_stringifies { "$_[0]" } my $fh = something_that_stringifies *CONFIG; my $config = $parser->parse($fh);

      - (do not use).

Re: (XML::Parser) Finding and fixing a bug
by morbus (Sexton) on Apr 22, 2003 at 17:52 UTC
    I can confirm this works with XML::Parser 2.27 on Win32. I didn't have the problem myself, but one of my users implemented the line change and he dropped from 100+ megs of usage to 26meg. Thanks Juerd! This had been something plaguing me that I couldn't duplicate, nor determine how best to debug.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://251630]
Approved by claree0
Front-paged by demerphq
[ambrus]: erix: that one actually sucks. these days people should get rid of the old notion that TeX is the only thing you can use for decent mathematics writing, because MS Office and LibreOffice have reached the
[ambrus]: level where people can more easily write as good mathematical papers in them as the people who write bad LaTeX papers usually write.
[ambrus]: Yes, for like the first twenty of its years, TeX was basically the only system that allowed you to write decent maths papers, and C++ and PHP were programming languages that sucked, etc. But times change and people have to accept that.
Discipulus bad people + good tool < normal people + decent tool
[Discipulus]: php does not suck anymore?
[ambrus]: Discipulus: I'm not sure, but it certainly doesn't suck as much as it's used to. it's like C++, it sucks because people still recursively learn from twenty year old PHP examples,
[ambrus]: and they try to use the obsolete features that PHP has to support only for compatibility with old scripts. C++ and PHP both have the problem that people can't forget the past, because when they google "PHP" plus the problme they want to solve, they find b
[ambrus]: ad code examples.
[ambrus]: I'm not trying to recommend PHP, but I think it has way too bad a name because of its past.

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2017-09-26 11:08 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (293 votes). Check out past polls.