Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Stepping up from XML::Simple to XML::LibXML

by grantm (Parson)
on Sep 10, 2005 at 07:20 UTC ( #490846=perltutorial: print w/ replies, xml ) Need Help??

If your XML parsing requirements can be boiled down to "slurp an XML file into a hash", then XML::Simple is very likely all you need. However, many people who start using XML::Simple continue to cling to the module even when their requirements have outgrown it. Most often, it's fear of the unknown that keeps them from 'stepping up'; to a more capable module. In this article, I'm going to attempt to dispel some of that fear by comparing using XML::LibXML to using XML::Simple.

Installation

EDIT: Things have moved on since I wrote this, and now XML::LibXML is included with most (all?) popular builds of Perl for Windows (Activestate and Strawberry Perl) and is pre-built and packaged for all major Linux distros. Also, XML::XPath is buggy and no-longer maintained so I don't recommend that.

If you're running Windows, you can get a binary build of XML::LibXML from Randy Kobes' PPM repositories. If you're running Linux then things will be even simpler - just use the package from your distribution (eg: on Debian: apt-get install libxml-libxml-perl).

If for some reason you're unable to install XML::LibXML, but you have XML::Parser, then you might like to install XML::XPath which is a Pure Perl module that implements a very similar API to LibXML but uses XML::Parser for the parsing bit.

Some Sample Data

Let's start with a file that lists the details of books in a (very small) library:

<library> <book> <title>Perl Best Practices</title> <author>Damian Conway</author> <isbn>0596001738</isbn> <pages>542</pages> <image src="http://www.oreilly.com/catalog/covers/perlbp.s.gif" width="145" height="190" /> </book> <book> <title>Perl Cookbook, Second Edition</title> <author>Tom Christiansen</author> <author>Nathan Torkington</author> <isbn>0596003137</isbn> <pages>964</pages> <image src="http://www.oreilly.com/catalog/covers/perlckbk2.s.gi +f" width="145" height="190" /> </book> <book> <title>Guitar for Dummies</title> <author>Mark Phillips</author> <author>John Chappell</author> <isbn>076455106X</isbn> <pages>392</pages> <image src="http://media.wiley.com/product_data/coverImage/6X/07 +645510/076455106X.jpg" width="100" height="125" /> </book> </library>

A Simple Problem

As a warm-up exercise, let's list the titles of all the books from the XML file. Please assume all the code samples begin as follows:

#!/usr/bin/perl use strict; use warnings; my $filename = 'library.xml';

Here's one solution, using XML::Simple:

use XML::Simple qw(:strict); my $library = XMLin($filename, ForceArray => 1, KeyAttr => {}, ); foreach my $book (@{$library->{book}}) { print $book->{title}->[0], "\n" }

And here's a LibXML solution that works the same way:

use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); foreach my $book ($doc->findnodes('/library/book')) { my($title) = $book->findnodes('./title'); print $title->to_literal, "\n" }

The '/library/book' argument to findnodes is called an XPath expression. If we substitute a slightly more complex XPath expression, we can factor out one line of code from inside the loop:

foreach my $title ($doc->findnodes('/library/book/title')) { print $title->to_literal, "\n" }

And if it's code brevity we're looking for, we can take things even further (this is Perl after all):

print $_->data . "\n" foreach ($doc->findnodes('//book/title/text()' +));

A More Complex Query

Now, let's select a specific book using its ISBN number and list the authors. Using XML::Simple:

use XML::Simple qw(:strict); my $isbn = '0596003137'; my $library = XMLin($filename, ForceArray => [ 'book', 'author' ], KeyAttr => { book => 'isbn' } ); my $book = $library->{book}->{$isbn}; print "$_\n" foreach(@{$book->{author}});

And with LibXML:

use XML::LibXML; my $isbn = '0596003137'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); my $query = "//book[isbn/text() = '$isbn']/author/text()"; print $_->data . "\n" foreach ($doc->findnodes($query));

This time, we've used a more complex XPath expression to identify both the <book> element and the <author> elements within it, in a single step. To understand that XPath expression, let's first consider a simpler one:

  //book[1]

This expression selects the first in a sequence of consecutive <book> elements. The [1] is actually a shorthand version of the more general form:

  //book[position() = 1]

Note XPath positions are numbered from 1 - weird huh?.

As you can see, the square brackets enclose an expression and the XPath query will match all nodes for which the expression evaulates to true. So to return to the XPath query from our last code sample:

  //book[isbn/text() = '0596003137']/author/text()

This will match the text content of any <author> elements within a <book> element which also contains an <isbn> element with the text content '0596003137'. The leading // is kind of a wildcard and will match any number of levels of element nesting. After you've re-read that a few times, it might even start to make sense.

The XML::XPath distribution includes a command-line tool 'xpath' which you can use to test your XPath skills interactively. Here's an example of querying our file to extract the ISBN of any book over 900 pages long:

  xpath -q -e '//book[pages > 900]/isbn/text()' library.xml

To achieve the same thing with XML::Simple, you'd need to iterate over the elements yourself:

my $library = XMLin($filename, ForceArray => [ 'book' ], KeyAttr => + {}); foreach my $book (@{$library->{book}}) { print $book->{isbn}, "\n" if $book->{pages} > 900; }

Modifying the XML

One area in which XML::Simple is particularly weak is round-tripping an XML file - reading it, modifying the data and writing it back out as XML.

For this example, we're going to locate the data for the book with ISBN 076455106X and correct its page count from 392 to 394:

use XML::Simple qw(:strict); my $isbn = '076455106X'; my $xs = XML::Simple->new( ForceArray => 1, KeyAttr => { }, KeepRoot => 1, ); my $ref = $xs->XMLin($filename); my $books = $ref->{library}->[0]->{book}; my($book) = grep($_->{isbn}->[0] eq $isbn, @$books); $book->{pages}->[0] = '394'; print $xs->XMLout($ref);

In this example I've used a number of tricks to attempt to make the output format resemble the input format as closely as possible:

  • an XML::Simple object was used to ensure the exact same options were used both for input and output
  • the ForceArray option was turned on to ensure that elements didn't get turned into attributes - unfortunately this necessitates the use of the extra ->[0] indexing
  • the KeyAttr option was used to stop arrays being folded into hashes and thus losing the order of the <code ><book></code> elements - unfortunately this necessitates iterating through the elements rather than indexing directly by ISBN
  • the KeepRoot option was used to ensure the root element name was preserved - unfortunately this introduced an extra level of hash nesting

Even after disabling all the features that make XML::Simple both simple and convenient, the results are not ideal. Although the order of the books was preserved, the order of the child elements within each book was lost.

By contrast, the LibXML code to perform the same update is both simpler and more accurate:

use XML::LibXML; my $isbn = '076455106X'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); my $query = "//book[isbn = '$isbn']/pages/text()"; my($node) = $doc->findnodes($query); $node->setData('394'); print $doc->toString;

Other Operations

If you need to remove an element from an XML document using XML::Simple, you'd simply delete the appropriate hash key. With LibXML, you would call the removeChild method on the element's parent. For example:

my($book) = $doc->findnodes("//book[isbn = '$isbn']"); my $library = $book->parentNode; $library->removeChild($book);

To add an element with XML::Simple you'd add a new key to the hash. With LibXML, you must first create the new element, add any child elements (such as text content) and add it at the right point in the tree. For example:

my $rating = $doc->createElement('rating'); $rating->appendTextNode('5'); $book->appendChild($rating);

If that looks a bit too complex, there's also a convenience method you can use to add one element with text content in a single step:

$book->appendTextChild('rating', '5');

XML::LibXML also provides a very handy method called parse_balanced_chunk that allows you to create a collection of related DOM nodes from a string containing an XML fragment. You can then add those nodes to your document:

my $fragment = $parser->parse_balanced_chunk( '<rating>5</rating><price>32.00</price>' ); $book->appendChild($fragment);

When you call toString to output the XML, you'll find the nodes you've added are not nicely indented as they would be with XML::Simple. This is hardly surprising since such indenting would require extra text nodes and if you don't add them they won't magically appear. In theory, you can call toString(1) to specify you want indents added, but I haven't had any success with that. You can however pipe the output through:

xmllint --format -

The xmllint utility is part of the libxml distribution.

Resources

The documentation for XML::LibXML is spread across a number of classes, including:

Zvon.org hosts an XPath Tutorial and an interactive XPath lab.

Comment on Stepping up from XML::Simple to XML::LibXML
Select or Download Code
Re: Stepping up from XML::Simple to XML::LibXML
by chovy (Initiate) on Sep 17, 2008 at 22:05 UTC
    The example is an xml fragment, can you show one using a real xml file? One with a default namespace declared... It appears if you have: <library xmlns="http://www.perlmonks.org/xml/example/library"> </library> Then what would the xpath queries look like?
      This is the exact problem I am trying to solve now (namespace defined at library).
      I added:
      my $xc = XML::LibXML::XPathContext->new($doc);<br/> $xc->registerNs('ns', 'xmlapi_1.0');
      and changed the foreach to:
      foreach my $book ($xc->findnodes('//ns:book')) {

      But no matter what I try, I can't get book's attributes. I know I'm getting the info (print $book->to_literal, "\n";) but how do I access the info individually?

        That problem was addressed in a separate thread. I meant to provide a link from here but then I forgot.

        Aristotle provided a comprehensive answer in this node.

Re: Stepping up from XML::Simple to XML::LibXML
by weismat (Friar) on Oct 22, 2009 at 08:49 UTC
    Re^2: XML::Lib XML question contains a discussion in case you need to include namespaces into the XPath query.
    http://pgfearo.googlepages.com/ Sketchpath provides a great editor for XPath - you create a query by clicking on the desired node in the XML tree.
Re: Stepping up from XML::Simple to XML::LibXML
by runrig (Abbot) on Dec 05, 2011 at 23:54 UTC
    If you're running Windows, you can get a binary build of XML::LibXML from Randy Kobes' PPM repositories. If you're running Linux then things will be even simpler - just use the package from your distribution (eg: on Debian: apt-get install libxml-libxml-perl).

    Sadly, with Randy Kobes passing, the Kobes repository has been rather unmaintained (is there no one at U Winnepeg that can step up?? Please??). Fortunately, Strawberry Perl comes with XML::LibXML, although I'm not sure how easy it is to make updates to the library.

    Update: It looks like there is a version in ActiveState's repo, though kind of old.

      Re: Installing iodbc 0.1 via CPAN (or other means) on Mac OS 10.6

      http://win32.perl.org/wiki/index.php?title=Vanilla_Perl_and_GnuWin32#LibXSLT_install

      http://win32.perl.org/wiki/index.php?title=Vanilla_Perl_and_GnuWin32#LibXML_install

      http://strawberryperl.com/release-notes/5.12.3.0.html says it comes with libxml/libxslt already

      Probably http://www.citrusperl.com/

Re: Stepping up from XML::Simple to XML::LibXML
by Anonymous Monk on Jun 11, 2012 at 19:56 UTC
    $channel_template =<<END; <object> <object_name>object_1</object_name> <object_id>059c6c8f-52b1-4d3d-8023-f5d333334456</object_id> <object_admin_state>Enabled</object_admin_state> <sub_object> <sub_object_id>059c6c8f-52b1-4d3d-8023-f5d318f30a63</sub_objec +t_id> </sub_object> </object> END $parser = XML::LibXML->new(); $parser->keep_blanks(0); $doc = $parser->load_xml(string => $channel_template); $doc->setEncoding('UTF-8'); $root = $doc->documentElement();
    This gives me access to add/remove/update elements within and below 'object' I was able to add objects easily with code like:
    $output_element = $doc->createElement('out');
    and then subsequently modify the output element and still maintain the entire XML. Recent, the schema for this object changed and added a wrapper 'object_wrapper' to the XML. Like so:
    <object_wrapper> <object> <object_name>object_1</object_name> <object_id>059c6c8f-52b1-4d3d-8023-f5d333334456</object_id> <object_admin_state>Enabled</object_admin_state> <sub_object> <sub_object_id>059c6c8f-52b1-4d3d-8023-f5d318f30a63</sub_o +bject_id> </sub_object> </object> </object_wrapper>
    Now my documentElement is objectWrapper, and I need it to traverse to object so my logic will remain the same. I've tried setDocument element by finding the node 'object', like the following, but this eliminates the wrapper element:
    $parser = XML::LibXML->new(); $parser->keep_blanks(0); $doc = $parser->load_xml(string => $channel_template); $doc->setEncoding('UTF-8'); $root = $doc->documentElement(); my ($object_element) = $root->findnodes('//object'); $doc->setDocumentElement($channel_element); $root = $doc->documentElement();
    Any help would be appreciated. Thanks!
Re: Stepping up from XML::Simple to XML::LibXML
by sjf4 (Initiate) on Apr 18, 2013 at 23:04 UTC
    The version of XML::LibXML that's in RHEL5 (1.58) performs badly when performing findnodes on the results of findnodes for large XML files. The latest version from CPAN performs fine.
    use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); foreach my $book ($doc->findnodes('/library/book')) { my($title) = $book->findnodes('./title'); print $title->to_literal, "\n" }
    Here is my workaround for RHEL5:
    use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); foreach my $book ($doc->findnodes('/library/book')) { my($title) = $book->getChildrenByTagName('title'); print $title->to_literal, "\n" }
    Despite $book->childNodes returning only $book's child nodes, $book->findnodes appears to perform findnodes on the entire contents of $doc.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://490846]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2014-08-20 07:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (107 votes), past polls