Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

If your XML parsing requirements can be boiled down to "slurp an XML file into a hash", then XML::Simple is very likely all you need. However, many people who start using XML::Simple continue to cling to the module even when their requirements have outgrown it. Most often, it's fear of the unknown that keeps them from 'stepping up'; to a more capable module. In this article, I'm going to attempt to dispel some of that fear by comparing using XML::LibXML to using XML::Simple.

EDIT: Since writing this article, I have subsequently created the much more complete tutorial: Perl XML::LibXML by Example

Installation

EDIT: Things have moved on since I wrote this, and now XML::LibXML is included with most (all?) popular builds of Perl for Windows (Activestate and Strawberry Perl) and is pre-built and packaged for all major Linux distros. Also, XML::XPath is buggy and no-longer maintained so I don't recommend that.

If you're running Windows, you can get a binary build of XML::LibXML from Randy Kobes' PPM repositories. If you're running Linux then things will be even simpler - just use the package from your distribution (eg: on Debian: apt-get install libxml-libxml-perl).

If for some reason you're unable to install XML::LibXML, but you have XML::Parser, then you might like to install XML::XPath which is a Pure Perl module that implements a very similar API to LibXML but uses XML::Parser for the parsing bit.

Some Sample Data

Let's start with a file that lists the details of books in a (very small) library:

<library> <book> <title>Perl Best Practices</title> <author>Damian Conway</author> <isbn>0596001738</isbn> <pages>542</pages> <image src="http://www.oreilly.com/catalog/covers/perlbp.s.gif" width="145" height="190" /> </book> <book> <title>Perl Cookbook, Second Edition</title> <author>Tom Christiansen</author> <author>Nathan Torkington</author> <isbn>0596003137</isbn> <pages>964</pages> <image src="http://www.oreilly.com/catalog/covers/perlckbk2.s.gi +f" width="145" height="190" /> </book> <book> <title>Guitar for Dummies</title> <author>Mark Phillips</author> <author>John Chappell</author> <isbn>076455106X</isbn> <pages>392</pages> <image src="http://media.wiley.com/product_data/coverImage/6X/07 +645510/076455106X.jpg" width="100" height="125" /> </book> </library>

A Simple Problem

As a warm-up exercise, let's list the titles of all the books from the XML file. Please assume all the code samples begin as follows:

#!/usr/bin/perl use strict; use warnings; my $filename = 'library.xml';

Here's one solution, using XML::Simple:

use XML::Simple qw(:strict); my $library = XMLin($filename, ForceArray => 1, KeyAttr => {}, ); foreach my $book (@{$library->{book}}) { print $book->{title}->[0], "\n" }

And here's a LibXML solution that works the same way:

use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); foreach my $book ($doc->findnodes('/library/book')) { my($title) = $book->findnodes('./title'); print $title->to_literal, "\n" }

The '/library/book' argument to findnodes is called an XPath expression. If we substitute a slightly more complex XPath expression, we can factor out one line of code from inside the loop:

foreach my $title ($doc->findnodes('/library/book/title')) { print $title->to_literal, "\n" }

And if it's code brevity we're looking for, we can take things even further (this is Perl after all):

print $_->data . "\n" foreach ($doc->findnodes('//book/title/text()' +));

A More Complex Query

Now, let's select a specific book using its ISBN number and list the authors. Using XML::Simple:

use XML::Simple qw(:strict); my $isbn = '0596003137'; my $library = XMLin($filename, ForceArray => [ 'book', 'author' ], KeyAttr => { book => 'isbn' } ); my $book = $library->{book}->{$isbn}; print "$_\n" foreach(@{$book->{author}});

And with LibXML:

use XML::LibXML; my $isbn = '0596003137'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); my $query = "//book[isbn/text() = '$isbn']/author/text()"; print $_->data . "\n" foreach ($doc->findnodes($query));

This time, we've used a more complex XPath expression to identify both the <book> element and the <author> elements within it, in a single step. To understand that XPath expression, let's first consider a simpler one:

  //book[1]

This expression selects the first in a sequence of consecutive <book> elements. The [1] is actually a shorthand version of the more general form:

  //book[position() = 1]

Note XPath positions are numbered from 1 - weird huh?.

As you can see, the square brackets enclose an expression and the XPath query will match all nodes for which the expression evaulates to true. So to return to the XPath query from our last code sample:

  //book[isbn/text() = '0596003137']/author/text()

This will match the text content of any <author> elements within a <book> element which also contains an <isbn> element with the text content '0596003137'. The leading // is kind of a wildcard and will match any number of levels of element nesting. After you've re-read that a few times, it might even start to make sense.

The XML::XPath distribution includes a command-line tool 'xpath' which you can use to test your XPath skills interactively. Here's an example of querying our file to extract the ISBN of any book over 900 pages long:

  xpath -q -e '//book[pages > 900]/isbn/text()' library.xml

To achieve the same thing with XML::Simple, you'd need to iterate over the elements yourself:

my $library = XMLin($filename, ForceArray => [ 'book' ], KeyAttr => + {}); foreach my $book (@{$library->{book}}) { print $book->{isbn}, "\n" if $book->{pages} > 900; }

Modifying the XML

One area in which XML::Simple is particularly weak is round-tripping an XML file - reading it, modifying the data and writing it back out as XML.

For this example, we're going to locate the data for the book with ISBN 076455106X and correct its page count from 392 to 394:

use XML::Simple qw(:strict); my $isbn = '076455106X'; my $xs = XML::Simple->new( ForceArray => 1, KeyAttr => { }, KeepRoot => 1, ); my $ref = $xs->XMLin($filename); my $books = $ref->{library}->[0]->{book}; my($book) = grep($_->{isbn}->[0] eq $isbn, @$books); $book->{pages}->[0] = '394'; print $xs->XMLout($ref);

In this example I've used a number of tricks to attempt to make the output format resemble the input format as closely as possible:

  • an XML::Simple object was used to ensure the exact same options were used both for input and output
  • the ForceArray option was turned on to ensure that elements didn't get turned into attributes - unfortunately this necessitates the use of the extra ->[0] indexing
  • the KeyAttr option was used to stop arrays being folded into hashes and thus losing the order of the <code ><book></code> elements - unfortunately this necessitates iterating through the elements rather than indexing directly by ISBN
  • the KeepRoot option was used to ensure the root element name was preserved - unfortunately this introduced an extra level of hash nesting

Even after disabling all the features that make XML::Simple both simple and convenient, the results are not ideal. Although the order of the books was preserved, the order of the child elements within each book was lost.

By contrast, the LibXML code to perform the same update is both simpler and more accurate:

use XML::LibXML; my $isbn = '076455106X'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); my $query = "//book[isbn = '$isbn']/pages/text()"; my($node) = $doc->findnodes($query); $node->setData('394'); print $doc->toString;

Other Operations

If you need to remove an element from an XML document using XML::Simple, you'd simply delete the appropriate hash key. With LibXML, you would call the removeChild method on the element's parent. For example:

my($book) = $doc->findnodes("//book[isbn = '$isbn']"); my $library = $book->parentNode; $library->removeChild($book);

To add an element with XML::Simple you'd add a new key to the hash. With LibXML, you must first create the new element, add any child elements (such as text content) and add it at the right point in the tree. For example:

my $rating = $doc->createElement('rating'); $rating->appendTextNode('5'); $book->appendChild($rating);

If that looks a bit too complex, there's also a convenience method you can use to add one element with text content in a single step:

$book->appendTextChild('rating', '5');

XML::LibXML also provides a very handy method called parse_balanced_chunk that allows you to create a collection of related DOM nodes from a string containing an XML fragment. You can then add those nodes to your document:

my $fragment = $parser->parse_balanced_chunk( '<rating>5</rating><price>32.00</price>' ); $book->appendChild($fragment);

When you call toString to output the XML, you'll find the nodes you've added are not nicely indented as they would be with XML::Simple. This is hardly surprising since such indenting would require extra text nodes and if you don't add them they won't magically appear. In theory, you can call toString(1) to specify you want indents added, but I haven't had any success with that. You can however pipe the output through:

xmllint --format -

The xmllint utility is part of the libxml distribution.

Resources

The documentation for XML::LibXML is spread across a number of classes, including:

Zvon.org hosts an XPath Tutorial and an interactive XPath lab.


In reply to Stepping up from XML::Simple to XML::LibXML by grantm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-03-28 13:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found