Background
There are many ways to parse XML, and perl provides more ways than probably any other language. I prefer to use XML::Simple where I can, as it allows me to quickly start solving the problem at hand, rather than be distracted by parser issues. That said, XML::Simple has some drawbacks too - it can be painfully slow, and has approximately a billion options. Once the learning curve of which options are usually required is overcome, your still stuck with the speed issue. Hopefully this meditation will help you make some informed choices in that regard.
Motivation
We have a client who provides all their data to us in XML files. We process this XML, and supply the client with a new data format, and other data, all burned onto a shiny CD, every day. The supplied XML is biggish - upto 2Mb, and we might get a hundred such files in a day.
My naive implemetation using XML::Simple was too slow - sometimes upto 6 minutes per file. It wasnt really the servers fault - it is an Enterprise-class Sun box with 6 processors and gigs of ram. Granted it is a busy machine, but performance was pathetic.
One thing holding me back from a wholesale rewrite closer to a low-level parser was that the implementation using XML::Simple was correct, and had taken a huge effort to get there. A rewrite would require re-verification all over again - no thanks.
I decided to see if a way could be found to get better performance whilst keeping our XML::Simple-based implemetation.
Preparation and Execution
I read the doco for XML::Simple again, paying special note of the sections 'SAX Support' and 'Environment'. I then downloaded a number of libraries and modules. Some come with perl, some with paricular OS vendors dists of perl. I recommend you scan your own system(s) to see what you may need.
- expat - C library for parsing XML - 1.95.7
- libxml2 - C library for parsing XML - 2.5.4
- XML::Parser - perl wrapper around expat - 2.34
- XML::LibXML - perl wrapper around libxml2 - 1.58
- XML::SAX - perl module that supplies or consumes SAX events - 0.12
- XML::SAX::Expat - backend for XML::SAX that uses the Expat library to supply SAX events - 0.37
- XML::LibXML::SAX - backend for XML::SAX that uses the libxml2 library to supply SAX events - 1.00
- XML::Simple - converts an XML document to a perl hash (roughly) - can have different backends drive this hash creation - 2.09
<plug mode="on">I wont go into using perl to do XML parsing, via the SAX and DOM paradigms - if you want to know more, come to my talk on this at OSDC conference.</plug>
Also, I used the following code - note it isn't like most benchmark code you see in the monastery - for a start it doesnt 'use Benchmark;'. This wasnt necessary as we are messuring an operation that takes 10's of seconds, not micro- and milli- seconds. The differences in speed standout without the help of the Benchmark module.
This code is pretty ugly too, but I believe it doesnt need a lot of trimming/reshaping. I was careful to make sure the measurements are tightly wrapped around the method/function calls, so as not to artificially inflate durations.
Notice the is_deeply() method calls - this is to confirm that the different backends all cause XML::Simple to generate the same data structure.#!/usr/bin/perl -w use strict; use XML::Simple qw(:strict); use Time::HiRes qw(time); use File::stat; use Test::More qw(no_plan); my $xs; my $XMLFile = $ARGV[0]; my $size = stat($XMLFile)->size(); print "File $XMLFile is " . $size . " bytes\n"; my ($start, $end); $xs = XML::Simple->new(ForceArray => 0, KeyAttr => {}); my $backend = ''; my $xml_default; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_default = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::Parser'; my $xml_x_p; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_x_p = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::SAX::Expat'; my $xml_x_s_e; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend; $start = time(); $xml_x_s_e = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); $backend = 'XML::LibXML::SAX'; my $xml_x_l_s; { local $ENV{XML_SIMPLE_PREFERRED_PARSER} = 'XML::LibXML::SAX'; $start = time(); $xml_x_l_s = $xs->XMLin($XMLFile); $end = time(); } print_result($backend, $end, $start, $size); is_deeply($xml_default, $xml_x_p); is_deeply($xml_default, $xml_x_s_e); is_deeply($xml_default, $xml_x_l_s); sub print_result { my ($backend, $end, $start, $size) = @_; my $duration = $end - $start; print "XML::Simple with $backend backend took ", sprintf("%02.4f", + $duration), " seconds. "; print "This equates to ", sprintf("%02.4f", $size / ($duration)), +" kilobytes per second (1024 bytes per k)\n"; }
Results
[le6303@itdevtst perl]$ perl xml.pl bigxml.xml File bigxml.xml is 1730463 bytes XML::Simple with default backend took 12.9769 seconds. This equates to + 133349.4084 kilobytes per second (1024 bytes per k) XML::Simple with XML::Parser backend took 3.6010 seconds. This equates + to 480549.2074 kilobytes per second (1024 bytes per k) XML::Simple with XML::SAX::Expat backend took 13.6003 seconds. This eq +uates to 127237.2038 kilobytes per second (1024 bytes per k) XML::Simple with XML::LibXML::SAX backend took 6.3547 seconds. This eq +uates to 272310.8906 kilobytes per second (1024 bytes per k) ok 1 ok 2 ok 3 1..3
So XML::Parser 'wins', reducing runtimes to 26% of its slower cousings. On our platforms here we are actually seeing reductions to around 10% - mission accomplished.
Analysis
Tracing the code and comparing to the doco reveal
- XML::Simple with the default backend and XML::Simple with the XML::SAX::Expat backend are actually the same operation.
- If you do not have XML::SAX installed, XML::Simple with the default backend and XML::Simple with the XML::Parser backend are actually the same operation.
- If you have both XML::SAX and XML::Parser installed, it would seem best to enable XML::Parser as the preferred parser, probably via the envvar. If a developer has a particular need of a SAX parser, he can override it via the package variable $XML::Simple::PREFERRED_PARSER - it has a higher priority than the envvar.
Update 16:53 23 Nov 04 Just for comparison, running XML::Parser alone on the same file in 'Subs' style takes, on average, 2.1 seconds.
use brain;
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: XML::Simple Benchmarks with various backends
by eric256 (Parson) on Nov 22, 2004 at 15:16 UTC | |
Re: XML::Simple Benchmarks with various backends
by Zaxo (Archbishop) on Nov 22, 2004 at 21:03 UTC | |
Re: XML::Simple Benchmarks with various backends
by samtregar (Abbot) on Nov 22, 2004 at 20:19 UTC | |
by leriksen (Curate) on Nov 23, 2004 at 00:19 UTC |