Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Fastest way of XML -> perl structure

by sectokia (Acolyte)
on Mar 19, 2017 at 22:16 UTC ( #1185210=perlquestion: print w/replies, xml ) Need Help??
sectokia has asked for the wisdom of the Perl Monks concerning the following question:

Hi wise monks,

What is the fastest way to go from XML to perl? I have 16MB+ XML files that are taking many seconds.

I have tried XML::Simple XML::Fast and XML::Bare, but all are surprisingly slow. Normalised its: Simple 1.0, fast 0.55, bare 0.4. But even then, that seems rediculously slow, with 3GHz machines still taking 10+ seconds.

In comparison, I wrote a dodgy C program that takes the XML and outputs a eval'able perl literal structure of nested array/hashes. Running the program and eval'ing the output is nearly 3x faster than xml::bare!

However I feel like I am re-inventing the wheel here (my dodgy program doesn't support attributes) and people must know a fast way to go from xml to perl already?

My other question is: Since eval is where most of the processing time is, is there some sort of 'direct' memory format for perl? For example: I would like my C program to output a 'memory blob' of nested arrays/hashes/scalers that would go straight into Perl, without having to 'parse'/'eval' anything.

The structures I want to put in are mostly like this:

{'elements' => [ 'element' => { 'item' => 'value', 'item2' => 'value' } , 'element2' => { 'item' => 'value', 'item2' => 'value' } , ] , 'elements2' => [ 'element' => { 'item' => 'value', 'item2' => 'value' } , 'element2' => { 'item' => 'value', 'item2' => 'value' } , ] }

Replies are listed 'Best First'.
Re: Fastest way of XML -> perl structure
by Marshall (Abbot) on Mar 19, 2017 at 23:14 UTC
    I don't much about XML document performance, but maybe you are not using the best module? I found XML-LibXML which had an example like this:

    use XML::LibXML; # load open my $fh, '<', 'file.xml'; binmode $fh; # drop all PerlIO layers possibly created by a use open + pragma $doc = XML::LibXML->load_xml(IO => $fh); # save open my $out, '>', 'out.xml'; binmode $out; # as above $doc->toFH($out); # or print {$out} $doc->toString();
    If this thing works in binary mode on the file handle and has a C XS implementation in Perl, it will be pretty fast.

    XML::Simple says:

    The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended and XML::Twig is an excellent alternative.
    Update: see following post from Corion++
Re: Fastest way of XML -> perl structure
by Corion (Pope) on Mar 20, 2017 at 07:47 UTC

    If you have an XSD (or whatever structural description of the XML), you should be able to create C code from that XSD which will then in turn create the appropriate Perl structures.

    There is XML::Compile, which creates Perl code from an XSD by creating more or less a top-down parser for it.

    There also is XML::LibXML, but that doesn't give you a ready-to-use Perl structure.

    A specialized generated C program should be quite fast and certainly faster than XML::LibXML, at least if you approach it with the idea that the incoming XML will strictly conform to your XSD and that any nonconformance will need no memory cleanup as the program is supposed to exit anyway.

Re: Fastest way of XML -> perl structure
by vrk (Chaplain) on Mar 20, 2017 at 15:28 UTC

    Many years ago at university, I wrote a GUI program that had to read, seek and render mzXML files on the fly. These files contained 1-600 MB of data, although some of it was in flattened substructures contained by the main XML structure. XML::Twig saved my bacon. It allowed very fast seeking and navigating the file without having to keep most of it in memory. When you do need to read a subtree into memory, it's trivial.

    However, it's hard to say whether XML::Twig (or XML::LibXML as also suggested) will be of any help without knowing a bit more about the input data. Is there something peculiar about the structure of the XML files, like very deep hierarchies, or hundreds of attributes per element, or something else that would explain the slowness in parsing?

Re: Fastest way of XML -> perl structure
by Discipulus (Monsignor) on Mar 20, 2017 at 17:01 UTC
    Hello sectokia,

    As other monks I prefere to use XML::Twig to parse and write back XML.

    Anyway be adviced to NOT use XML::Simple It is an autodeprecated module: see XML::Simple needs to go! for many details and links to alternatives.

    If I can cite brian_d_foy let me add something: "If you need to deal with XML, first we are sorry.."


    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Fastest way of XML -> perl structure
by Preceptor (Deacon) on Mar 20, 2017 at 15:28 UTC

    What are you actually trying to accomplish? As best I can tell, one of the biggest problems with XML parsing is that - because it's tag matched - you end up having to read the whole doc into memory before you can even start.

    If I'm after speed, I often find an incremental parsing approach using XML::Twig works quite well, because you can fire twig handlers and purge the data structure as you go.

      > you end up having to read the whole doc into memory before you can even start

      Interesting. How did they contrive XML::SAX or XML::LibXML::Reader ?

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1185210]
Approved by beech
Front-paged by stevieb
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2018-05-26 23:46 GMT
Find Nodes?
    Voting Booth?