http://www.perlmonks.org?node_id=409517

I would like to relate to the monastery some recent experiments on benchmarking the parsing speed of XML documents with XML::Simple, but with various parsers plugged-in at the back of it.

Background

There are many ways to parse XML, and perl provides more ways than probably any other language. I prefer to use XML::Simple where I can, as it allows me to quickly start solving the problem at hand, rather than be distracted by parser issues. That said, XML::Simple has some drawbacks too - it can be painfully slow, and has approximately a billion options. Once the learning curve of which options are usually required is overcome, your still stuck with the speed issue. Hopefully this meditation will help you make some informed choices in that regard.

Motivation

We have a client who provides all their data to us in XML files. We process this XML, and supply the client with a new data format, and other data, all burned onto a shiny CD, every day. The supplied XML is biggish - upto 2Mb, and we might get a hundred such files in a day.
My naive implemetation using XML::Simple was too slow - sometimes upto 6 minutes per file. It wasnt really the servers fault - it is an Enterprise-class Sun box with 6 processors and gigs of ram. Granted it is a busy machine, but performance was pathetic.
One thing holding me back from a wholesale rewrite closer to a low-level parser was that the implementation using XML::Simple was correct, and had taken a huge effort to get there. A rewrite would require re-verification all over again - no thanks.
I decided to see if a way could be found to get better performance whilst keeping our XML::Simple-based implemetation.

Preparation and Execution

I read the doco for XML::Simple again, paying special note of the sections 'SAX Support' and 'Environment'. I then downloaded a number of libraries and modules. Some come with perl, some with paricular OS vendors dists of perl. I recommend you scan your own system(s) to see what you may need.

  • expat - C library for parsing XML - 1.95.7
  • libxml2 - C library for parsing XML - 2.5.4
  • XML::Parser - perl wrapper around expat - 2.34
  • XML::LibXML - perl wrapper around libxml2 - 1.58
  • XML::SAX - perl module that supplies or consumes SAX events - 0.12
  • XML::SAX::Expat - backend for XML::SAX that uses the Expat library to supply SAX events - 0.37
  • XML::LibXML::SAX - backend for XML::SAX that uses the libxml2 library to supply SAX events - 1.00
  • XML::Simple - converts an XML document to a perl hash (roughly) - can have different backends drive this hash creation - 2.09

<plug mode="on">I wont go into using perl to do XML parsing, via the SAX and DOM paradigms - if you want to know more, come to my talk on this at OSDC conference.</plug>

Also, I used the following code - note it isn't like most benchmark code you see in the monastery - for a start it doesnt 'use Benchmark;'. This wasnt necessary as we are messuring an operation that takes 10's of seconds, not micro- and milli- seconds. The differences in speed standout without the help of the Benchmark module.
This code is pretty ugly too, but I believe it doesnt need a lot of trimming/reshaping. I was careful to make sure the measurements are tightly wrapped around the method/function calls, so as not to artificially inflate durations.

#!/usr/bin/perl -w use strict; use XML::Simple qw(:strict); use Time::HiRes qw(time); use File::stat; use Test::More qw(no_plan); my $xs; my $XMLFile = $ARGV[0]; my $size = stat($XMLFile)->size(); print "File $XMLFile is " . $size . " bytes\n"; my ($start, $end); $xs = XML::Simple->new(ForceArray => 0, KeyAttr => {}); my $backend = ''; my $xml_default; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_default = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::Parser'; my $xml_x_p; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_x_p = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::SAX::Expat'; my $xml_x_s_e; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_x_s_e = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::LibXML::SAX'; my $xml_x_l_s; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = 'XML::LibXML::SAX'; $start = time(); $xml_x_l_s = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); is_deeply($xml_default, $xml_x_p); is_deeply($xml_default, $xml_x_s_e); is_deeply($xml_default, $xml_x_l_s); sub print_result { my ($backend, $end, $start, $size) = @_; my $duration = $end - $start; print "XML::Simple with $backend backend took ", sprintf("%02.4f", + $duration), " seconds. "; print "This equates to ", sprintf("%02.4f", $size / ($duration)), +" kilobytes per second (1024 bytes per k)\n"; }
Notice the is_deeply() method calls - this is to confirm that the different backends all cause XML::Simple to generate the same data structure.

Results

[le6303@itdevtst perl]$ perl xml.pl bigxml.xml File bigxml.xml is 1730463 bytes XML::Simple with default backend took 12.9769 seconds. This equates to + 133349.4084 kilobytes per second (1024 bytes per k) XML::Simple with XML::Parser backend took 3.6010 seconds. This equates + to 480549.2074 kilobytes per second (1024 bytes per k) XML::Simple with XML::SAX::Expat backend took 13.6003 seconds. This eq +uates to 127237.2038 kilobytes per second (1024 bytes per k) XML::Simple with XML::LibXML::SAX backend took 6.3547 seconds. This eq +uates to 272310.8906 kilobytes per second (1024 bytes per k) ok 1 ok 2 ok 3 1..3

So XML::Parser 'wins', reducing runtimes to 26% of its slower cousings. On our platforms here we are actually seeing reductions to around 10% - mission accomplished.

Analysis

Tracing the code and comparing to the doco reveal

Update 16:53 23 Nov 04 Just for comparison, running XML::Parser alone on the same file in 'Subs' style takes, on average, 2.1 seconds.

use brain;

Replies are listed 'Best First'.
Re: XML::Simple Benchmarks with various backends
by eric256 (Parson) on Nov 22, 2004 at 15:16 UTC

    I was curious so I ran my own benchmark. It seems to validate yours. Looks like a consistent 200% increase in speed with XML::Parser.

    use strict; use warnings; use XML::Simple; use Benchmark qw/cmpthese/; cmpthese(-5, { "Default" => sub { loadit(); }, "XML::Parser" => sub { loadit("XML::Parser"); }, "XML::SAX::Expat" => sub { loadit("XML::SAX::Expat"); }, }); sub loadit { my $parser = shift || ""; local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $parser; my $xs = XML::Simple->new(ForceArray => 0, KeyAttr => {}); my $xml = $xs->XMLin("xmltest.xml"); } __DATA__ C:\test>perl xmlbench.pl Rate XML::SAX::Expat Default XML::Parser XML::SAX::Expat 4.23/s -- -3% -67% Default 4.36/s 3% -- -66% XML::Parser 12.8/s 203% 194% -- C:\test>perl xmlbench.pl Rate XML::SAX::Expat Default XML::Parser XML::SAX::Expat 4.35/s -- -0% -66% Default 4.36/s 0% -- -66% XML::Parser 12.9/s 197% 196% -- C:\test>perl xmlbench.pl Rate XML::SAX::Expat Default XML::Parser XML::SAX::Expat 4.37/s -- -1% -66% Default 4.40/s 1% -- -66% XML::Parser 13.0/s 197% 195% --


    ___________
    Eric Hodges
Re: XML::Simple Benchmarks with various backends
by Zaxo (Archbishop) on Nov 22, 2004 at 21:03 UTC

    ++leriksen

    I applied this wisdom to my framechat2 installation, with gratifying results. Thanks for pointing out this feature of XML::Simple.

    After Compline,
    Zaxo

Re: XML::Simple Benchmarks with various backends
by samtregar (Abbot) on Nov 22, 2004 at 20:19 UTC
    You missed XML::SAX::ExpatXS. In the last tests I saw it was the fastest SAX parser.

    -sam

      OK I added this one

      • XML::SAX::ExpatXS - a SAX event generator wrapped around Expat, quite possibly a closer binding than XML::SAX::Expat - 1.02

      Extra code is

      $backend = 'XML::SAX::ExpatXS'; my $xml_x_s_exs; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_x_s_exs = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); ... is_deeply($xml_default, $xml_x_s_exs);

      New results are

      [le6303@itdevtst perl]$ ./xml.pl bigxml.xml File bigxml.xml is 1730463 bytes XML::Simple with default backend took 12.7477 seconds. This equates to + 135747.5992 kilobytes per second (1024 bytes per k) XML::Simple with XML::Parser backend took 3.5870 seconds. This equates + to 482431.8874 kilobytes per second (1024 bytes per k) XML::Simple with XML::SAX::Expat backend took 13.7021 seconds. This eq +uates to 126292.1408 kilobytes per second (1024 bytes per k) XML::Simple with XML::SAX::ExpatXS backend took 5.9447 seconds. This e +quates to 291093.0139 kilobytes per second (1024 bytes per k) XML::Simple with XML::LibXML::SAX backend took 6.5013 seconds. This eq +uates to 266172.9123 kilobytes per second (1024 bytes per k) ok 1 ok 2 ok 3 ok 4 1..4

      So a good result, but the ol' original XML::Parser is still faster. ExpatXS may be a faster parser, but it may not be as effective as a SAX generator. More research is required - anyone, anyone...

      I may get time on the weekend to try these with some more types of XML document e.g. one that references external entities (standalone="false"), large UUencoded binary blobs in CDATA sections, very small files etc

      use brain;