|Perl: the Markov chain saw|
Stepping up from XML::Simple to XML::LibXMLby grantm (Parson)
|on Sep 10, 2005 at 07:20 UTC||Need Help??|
If your XML parsing requirements can be boiled down to "slurp an XML file into a hash", then XML::Simple is very likely all you need. However, many people who start using XML::Simple continue to cling to the module even when their requirements have outgrown it. Most often, it's fear of the unknown that keeps them from 'stepping up'; to a more capable module. In this article, I'm going to attempt to dispel some of that fear by comparing using XML::LibXML to using XML::Simple.
EDIT: Things have moved on since I wrote this, and now XML::LibXML is included with most (all?) popular builds of Perl for Windows (Activestate and Strawberry Perl) and is pre-built and packaged for all major Linux distros. Also, XML::XPath is buggy and no-longer maintained so I don't recommend that.
If you're running Windows, you can get a binary build of XML::LibXML from Randy Kobes' PPM repositories. If you're running Linux then things will be even simpler - just use the package from your distribution (eg: on Debian: apt-get install libxml-libxml-perl).
Some Sample Data
Let's start with a file that lists the details of books in a (very small) library:
A Simple Problem
As a warm-up exercise, let's list the titles of all the books from the XML file. Please assume all the code samples begin as follows:
Here's one solution, using XML::Simple:
And here's a LibXML solution that works the same way:
The '/library/book' argument to findnodes is called an XPath expression. If we substitute a slightly more complex XPath expression, we can factor out one line of code from inside the loop:
And if it's code brevity we're looking for, we can take things even further (this is Perl after all):
A More Complex Query
Now, let's select a specific book using its ISBN number and list the authors. Using XML::Simple:
And with LibXML:
This time, we've used a more complex XPath expression to identify both the <book> element and the <author> elements within it, in a single step. To understand that XPath expression, let's first consider a simpler one://book
This expression selects the first in a sequence of consecutive <book> elements. The  is actually a shorthand version of the more general form://book[position() = 1]
Note XPath positions are numbered from 1 - weird huh?.
As you can see, the square brackets enclose an expression and the XPath query will match all nodes for which the expression evaulates to true. So to return to the XPath query from our last code sample://book[isbn/text() = '0596003137']/author/text()
This will match the text content of any <author> elements within a <book> element which also contains an <isbn> element with the text content '0596003137'. The leading // is kind of a wildcard and will match any number of levels of element nesting. After you've re-read that a few times, it might even start to make sense.
The XML::XPath distribution includes a command-line tool 'xpath' which you can use to test your XPath skills interactively. Here's an example of querying our file to extract the ISBN of any book over 900 pages long:xpath -q -e '//book[pages > 900]/isbn/text()' library.xml
To achieve the same thing with XML::Simple, you'd need to iterate over the elements yourself:
Modifying the XML
One area in which XML::Simple is particularly weak is round-tripping an XML file - reading it, modifying the data and writing it back out as XML.
For this example, we're going to locate the data for the book with ISBN 076455106X and correct its page count from 392 to 394:
In this example I've used a number of tricks to attempt to make the output format resemble the input format as closely as possible:
Even after disabling all the features that make XML::Simple both simple and convenient, the results are not ideal. Although the order of the books was preserved, the order of the child elements within each book was lost.
By contrast, the LibXML code to perform the same update is both simpler and more accurate:
If you need to remove an element from an XML document using XML::Simple, you'd simply delete the appropriate hash key. With LibXML, you would call the removeChild method on the element's parent. For example:
To add an element with XML::Simple you'd add a new key to the hash. With LibXML, you must first create the new element, add any child elements (such as text content) and add it at the right point in the tree. For example:
If that looks a bit too complex, there's also a convenience method you can use to add one element with text content in a single step:
XML::LibXML also provides a very handy method called parse_balanced_chunk that allows you to create a collection of related DOM nodes from a string containing an XML fragment. You can then add those nodes to your document:
When you call toString to output the XML, you'll find the nodes you've added are not nicely indented as they would be with XML::Simple. This is hardly surprising since such indenting would require extra text nodes and if you don't add them they won't magically appear. In theory, you can call toString(1) to specify you want indents added, but I haven't had any success with that. You can however pipe the output through:
The xmllint utility is part of the libxml distribution.
The documentation for XML::LibXML is spread across a number of classes, including: