Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: xml parsers: do I need one?

by samtregar (Abbot)
on Aug 29, 2003 at 02:29 UTC ( [id://287572]=note: print w/replies, xml ) Need Help??


in reply to xml parsers: do I need one?

My first reaction was that you must be using XML::Parser incorrectly. To test my assumption I created a 3.3MB XML file with 9000 elements containing random <message> elements. Then I created a regex script (regex.pl) and an XML::Parser script (parser.pl) which both pull out all the <message> contents. I ran them head to head and made sure they both produced the same output:

[sam@localhost xml_test]$ time ./parser.pl > parser.out real 0m1.827s user 0m1.730s sys 0m0.040s [sam@localhost xml_test]$ time ./regex.pl > regex.out real 0m0.200s user 0m0.160s sys 0m0.040s [sam@localhost xml_test]$ diff parser.out regex.out

So I'm seeing a simple regex beating a simple XML::Parser implementation by around 9x. Given that your regex takes three minutes, an XML::Parser script taking 35 minutes is within a similar multiple. When you consider that XML::Parser is making multiple Perl sub calls on each element it encounters, I guess it makes sense. But, still, ouch!.

Of course, if it were my job on the line I'd still use an XML parser. I've been bitten by changing specifications and funky data too many times to take the easy way out in the parser. In fact, these days I parse my XML twice - first with Xerces/C++ for schema validation and second with XML::Simple for actual usage. Better safe than sorry!

-sam


For the record, here's my test setup. First, the data generator:

#!/usr/bin/perl -w print '<?xml version="1" encoding="UTF-8" ?>', "\n"; print "<test>\n"; for (0 .. 9000) { my $word = get_word(); print "<$word>\n"; if (rand(10) > 3) { for (0 .. rand(5)) { my $msg = get_words(30); print "\t<message>$msg</message>\n"; } } print "</$word>\n"; } print "</test>"; BEGIN { my @words; open(WORDS, "/usr/dict/words") or open(WORDS, "/usr/share/dict/words") or die "Can't open /usr/dict/words or /usr/share/dict/words: $ +!"; while (<WORDS>) { chomp; push @words, $_ if /^\w+$/; } srand (time ^ $$); # get a random word sub get_word { return lc $words[int(rand(scalar(@words)))]; } # get $num random words, joined by $sep, defaulting to " " sub get_words { my ($num, $sep) = @_; $sep = " " unless defined $sep and length $sep; return join($sep, map { get_word() } (0 .. ((int(rand($num)))+ +1))); } }

The regex parser:

#!/usr/bin/perl -w open(FILE, 'test.xml') or die $!; my $xml = join('', <FILE>); while($xml =~ m!<message>([\w\s]+)</message>!g) { print $1, "\n"; }

And the XML::Parser script:

#!/usr/bin/perl -w use strict; use XML::Parser; my $p = new XML::Parser(Style => 'Stream', Pkg => 'main'); $p->parsefile('test.xml'); my $in_msg = 1; sub StartTag { $in_msg++ if $_ eq '<message>'; } sub Text { print $_ if $in_msg; } sub EndTag { if ($_ eq '</message>') { $in_msg--; print "\n"; } }

Replies are listed 'Best First'.
Re: Re: xml parsers: do I need one?
by mirod (Canon) on Aug 29, 2003 at 19:36 UTC

    Nice benchmark. I wouldn't use XML::Parser's Stream style, but it's probably because I am not very familiar with it.

    I expanded slightly this benchmark, creating a somewhat more complicated document, still around 3M and 10K elements, and run a bunch of modules on it.The results are quite surprising actually:

    10160 elements generated - (63 top level - 1721 to extract)
    bench_regexp             : 0:00.16 real 0.14  0.03 s
    bench_libxml             : 0:00.44 real 0.39  0.05 s
    bench_parser             : 0:00.88 real 0.83  0.01 s
    bench_parser_stream      : 0:01.15 real 1.10  0.06 s
    bench_twig               : 0:01.84 real 1.81  0.03 s
    bench_sax_base_libxml    : 0:03.29 real 3.25  0.05 s
    bench_sax_libxml         : 0:03.32 real 3.31  0.03 s
    bench_sax_expat          : 0:03.21 real 3.11  0.03 s
    bench_dom                : 0:04.51 real 4.41  0.03 s
    libxslt                  : 0:01.48 real 1.46  0.02 s
    xml_grep                 : 0:02.07 real 2.02  0.03 s
    

    I am very surprised by how slow the XML::SAX examples are (hence I wrote one using SAX::Base and 1 not using it). I did not expect this, and I will try to figure out what the problem is. If you look at the code, I really don't think I am using the PurePerl parser, I took great care of creating the parser myself. That's odd.

    Code and everything to run it is at http://xmltwig.com/article/simple_benchmark/.

    Get simple_benchmark.tar.gz

    tar zxvf simple_benchmark.tar.gz cd simple_benchmark perl run_all

    Note that the xml_grep version only works with the latest, greatest release of the tool, available somewhere else on the same site (with the development version of XML::Twig).

      Wow, and I thought I was going overboard! Very interesting. I'd like to see how Xerces/C++ fairs too, but I still can't build XML::Xerces. I've been using the DOMCount example program to do XML Schema validation...

      -sam

Re: Re: xml parsers: do I need one?
by regan (Sexton) on Aug 29, 2003 at 13:47 UTC
    I did some thinking, and decided:
    -speed was NOT essential, as this could really be done via a cron job that runs once in a while
    -The xml should not change (we've all heard that one!), but we both know that it will
    -xml parsing is far more flexible than regex
    -regex is far faster in this case
    -xml parsing is both cooler, and the Right Thing to do

    I will use a parser, I don't know which one, or how to pick it, but since I have experience using java's DOM parser, I'll probably stick with xml::dom
      since I have experience using java's DOM parser, I'll probably stick with xml::dom

      Please DON'T!

      XML::DOM is an old and barely maintained module. XML::LibXML implements the DOM, plus XPath, which makes it a LOT easier (and safer) to use. It drives me crazy to see the number of people still choosing to use it now, when better alternatives are clearly available.

      <rant mode="on">The fact that a module has the name of a standard does not mean a thing. More exactly, it just means that the author was the first one to write a module implementing some of the standard. It does not mean that it is a good module, that you should use it, or that there isn't a better module available. Use your brain! </rant>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://287572]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-29 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found