http://www.perlmonks.org?node_id=947987


in reply to Re^3: Is there any XML reader like this?
in thread Is there any XML reader like this?

And that's not even mentioning the fact that XML::LibXML is 20x faster

BTW. Even that factually correct claim only tells half the story. Generate a simple and fairly modest XML file using this:

#! perl -slw use strict; $|++; our $S //= '999'; our $I //= 10; open O, '>', 'junk.xml'; print O '<servers>'; for my $s ( '0001' .. $S ) { printf "\r%s", $s; print O "<station$s>"; print O '<ip>', join('.', unpack 'C4', pack 'N', int( rand 2**32 ) + ), '</ip>' for 1 .. $I; print O "</station$s>"; }; print O '</servers>'; close O;

Like this:

C:\test>xmlgen -S=9999 9999 C:\test>dir junk.xml 15/01/2012 12:40 2,424,205 junk.xml

Now run XML::Simple & XML::LibXML scripts that parse that file and iterate the contents and time them:

C:\test>xmllib junk.xml Parsing took 0.290895 seconds Iteration took 171.657306 seconds Total took 171.959000 seconds Check mem:63.6MB C:\test>xmlsimple junk.xml Parsing took 38.202000 seconds Iteration took 0.059186 seconds Total took 38.262577 seconds Check mem:142MB

All the time you gained during parsing, you throw away four-fold when accessing the data through the nightmare interface of OO baloney.

And if you double the file size:

C:\test>xmlgen -S=19999 19999 C:\test>dir junk.xml 15/01/2012 12:58 4,868,440 junk.xml

And now LibXML takes 8 times as long:

C:\test>xmllib junk.xml Parsing took 0.560000 seconds Iteration took 676.238758 seconds Total took 676.802000 seconds Check mem:107MB C:\test>xmlsimple junk.xml Parsing took 75.078000 seconds Iteration took 0.124583 seconds Total took 75.209615 seconds Check mem:254MB

Increase the file size 10-fold and LIbXML will take 100 time longer.

Now look carefully at the split times. XML::Simple's parsing time is slow, but linear with the file size. It's traversal time is extremely fast and also linear.

Conversely, LibXML's parsing time is very fast and linear; but it's traversal time is horribly slow and quadratic with the file size.

It is easy to see which one wins in the speed stakes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re^4: Is there any XML reader like this? (XML::Simple beats LibXML hands down in the speed stakes!)
  • Select or Download Code

Replies are listed 'Best First'.
Re^5: Is there any XML reader like this? (XML::Simple beats LibXML hands down in the speed stakes!)
by tobyink (Canon) on Jan 15, 2012 at 13:42 UTC

    Not an especially compelling case without posting the source code for the "XML::Simple & XML::LibXML scripts that parse that file and iterate the contents".

      Sorry, they are the same scripts as published earlier in the thread with the addition of a couple of timing points.

      But here ya go. Using LibXML:

      #! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; use XML::LibXML; open XML, '<', $ARGV[0] or die $!; my $start = time; my $root = XML::LibXML->load_xml( IO => \*XML )->documentElement; printf "Parsing took %.6f seconds\n", time - $start; my $start2 = time; for my $station ($root->findnodes('*')) { my $x = $station->nodeName; for my $ip ( $station->findnodes('ip') ) { $x = $ip->textContent; } } printf "Iteration took %.6f seconds\n", time - $start2; printf "Total took %.6f seconds\n", time - $start; printf 'Check mem:'; <STDIN>;

      And XML::Simple:

      #! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; use XML::Simple; open XML, '<', $ARGV[0] or die $!; my $start = time; my $stations = XMLin( \*XML, ForceArray => [ 'ip'], NoAttr => 1 ); printf "Parsing took %.6f seconds\n", time - $start; my $start2 = time; for my $station ( keys %$stations ) { my $x = $station; for my $ip ( @{ $stations->{ $station }{ip} } ) { $x = $ip; } } printf "Iteration took %.6f seconds\n", time - $start2; printf "Total took %.6f seconds\n", time - $start; printf 'Check mem:'; <STDIN>;

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        The LibXML example uses findnodes which is an XPath query. XPath, while extremely powerful, is not necessarily the most speedy solution, and it's not an especially fair comparison to the XML::Simple example. Replacing findnodes calls with getChildrenByTagName (the rest of the code can remain unchanged) speeds up the iteration tenfold. I get:

        [tai@miranda (pts/0) libxml]$ perl orig.pl junk.xml
        Parsing took 0.077047 seconds
        Iteration took 6.021286 seconds
        Total took 6.098525 seconds
        [tai@miranda (pts/0) libxml]$ perl new.pl junk.xml
        Parsing took 0.105245 seconds
        Iteration took 0.631286 seconds
        Total took 0.736719 seconds
        
Re^5: Is there any XML reader like this? (XML::Simple beats LibXML hands down in the speed stakes!)
by ikegami (Patriarch) on Jan 16, 2012 at 07:58 UTC

    It is easy to see which one wins in the speed stakes.

    Yeah, LibXML. My tests *included* the time it took to extract the data from the tree. The test was done with real world data of various size from three different providers.

    We use XML::Bare with a thin layer to compensate for it's awful interface (XML::Simple without ForceArray or any other option), its expectation of getting decoded text, and it's lack of namespace support. It's slightly faster when you factor in the time it takes to extract data. Not nearly as capable as libxml, and we had to create an interface just to be able to use it.

      Yeah, LibXML. My tests *included* the time it took to extract the data from the tree.

      Hm. So did mine. But I believe mine.

      We use XML::Bare with a thin layer to compensate for it's awful interface (XML::Simple without ForceArray or any other option)

      Hm. XML::Bare::forcearray( [noderef] )

      S'funny init. It took less than a minute to disprove that. And after 5 minutes, I'm pretty sure I could use XML::Bare to read a file and get access to its content.

      Conversely, when I tried to look up getDocumentElement, I completely crapped out after about an hour. You applied it to the return from load_xml() which is labelled $dom. So look in DOM. Nada. Maybe a Node. Nada. How about a parser, or a nodelist or a namespace? Nada, nada, nada!

      Your idea of an "awful interface" is weird.

      For me:

      • the best interface is the one I don't have to lookup more than once.

        That means small.

      • The second best interface is one that makes it easy to lookup what I need to know.

        That means the first page shows me enough to get something working.

        Details, refinements and esoterica can be deferred to secondary pages if that cannot be avoided.

      • The third best interface is one that if it has to be large, is logically grouped.

        That means, it starts by splitting the documentation along vertical lines. Ie. The way people need to use the interface. Eg, Read an XML; or write an XML; or edit an XML. etc. Not horizontally according to some arbitrary way the author decided to structure his code.

        And it means starting with the basics in the root document, in the form of simple -- but complete -- worked examples of the main modes of use. And leaving the esoteric details for (preferably linked (and links that actually work)) secondary pages.

        Not hitting the user in the face with a top level synopsis that contain every possible variation of the constructor and no indication of where to go from there.

      XML::LibXML fails on every count.

      Can we stop now, because we are once again doing nothing to help the OP; nor each other.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I really don't see how the documentation for XML::LibXML could be especially difficult to follow for anyone who is vaguely familiar with Perl OO programming. Just follow the usual rule:

        If $obj->isa($class) then consult perldoc $class.

        The documentation for the parser says that load_xml "the return value [...] is a XML::LibXML::Document object". So you turn to the XML::LibXML::Document documentation.

        The method is called documentElement. It's shown in the SYNOPSIS for XML::LibXML::Document, and documented further down in the METHODS section.

        getDocumentElement is just an alias for documentElement so is documented much less prominently, so I can understand how that could have been harder to find, but most clients that you'd view documentation in (e.g. browser, "man", "perldoc") allow you to search for strings quite easily.

        But anyway, some of your statements on XML::LibXML reveal what I think is a fundamental difference between what you want to do with XML, and what XML::LibXML is designed for.

        You just want to get data out of XML and handle it as some sort of native data structure. XML::LibXML is for people who want to keep their data in as XMLish a form as possible (short of loading it into memory as a single XML formatted string and manipulating it with regexps!) - for people who care (not just at loading and saving time) about the difference between:

        <html> <head><title>Foo</title></head> </html>

        and:

        <html> <head title="Foo" /> </html>

        For people who, given a node $x in some deeply nested data structure, sometimes need to do $x->parentNode.

        If you don't need to do that sort of stuff, then perhaps there's a mismatch between your needs and XML::LibXML's aims.

        I can tell you that XML::Simple would have been quite useless for something like XML::Atom::OWL.

        Your idea of an "awful interface" is weird.

        Usable, readable. If I have to use a gazillion defined and ref checks to extract one value, it's not usable, given that one function call is all that's needed.

        I've used XML::Simple and XML::LibXML, yet only the former gives me trouble.

        Can we stop now, because we are once again doing nothing to help the OP; nor each other.

        Then stop using the word "you" and stick to the subject.