Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Just another Perl shrine
 
PerlMonks  

LibXML, Namespaces, xpath expression - This should be simple.

by alittlebitdifferent (Initiate)
on Sep 03, 2011 at 14:54 UTC ( #923996=perlquestion: print w/ replies, xml ) Need Help??
alittlebitdifferent has asked for the wisdom of the Perl Monks concerning the following question:

I have been trying unsuccessfully to extract text from inside a fairly simple XML document that includes names spaces using LibXML.

Previously I had been using XPATH and it worked a treat...but the parsing was just too slow. After writing to the author (Matt Seargent) he suggested that I use LibXML as in his words it was "much much faster"

However..my XPATH expressions no longer seemed to work. I solved part of this problem by declaring a namespace but have been unable to craft the XPATH expression needed to get the text I am after.

I have simplified the problem below maintaining all the important structural elements and am hoping someone might be able to throw me a bone as to what I am doing wrong.

Example XML and the source code I am using is below

I would appreciate any thoughts at this stage as myself and Mr Google have come up empty handed.

Regards

Steddy.


=========================== <?xml version="1.0" encoding="UTF-8"?> <RootTag SchemaVersion="1.1" xmlns="http://www.wow.com/BlahML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocati +on="http://www.wow.com/BlahML BlahML1_1.xsd"> <MyTagA Tag="XXX">Some Text A</MyTagA> <MyTagB Tag="XXX">Some Text A</MyTagB> <MiddleTag xsi:type="foo" magicvalue="wow"> <MyImportantNode> xsi:type="foo" GeneralID="Random1" > <CannotGetTagA> YYYYYYYYYYY </CannotGetTagA> <CannotGetTagB> ZZZZZZZZZZZ </CannotGetTagB> </MyImportantNode> <MyImportantNode> xsi:type="foo" GeneralID="Random2" > <CannotGetTagA> YYYYYYYYYY22 </CannotGetTagA> <CannotGetTagB> ZZZZZZZZZZ22 </CannotGetTagB> </MyImportantNode> <MyImportantNode> xsi:type="foo" GeneralID="Random3" > <CannotGetTagA> YYYYYYYYY333 </CannotGetTagA> <CannotGetTagB> ZZZZZZZZZ333 </CannotGetTagB> </MyImportantNode> </MiddleTag> </RootTag> =========================== use strict; use XML::LibXML; my $version = "v0.1 - Dances with XML"; my $parser = XML::LibXML->new(); $parser->recover_silently(1); my $doc = $parser->parse_file('test.xml'); #<-- The above XML my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs(theNS => 'http://www.wow.com/BlahML'); my @MiddleNodeBits = $xpc->findnodes('//theNS:MiddleTag'); foreach my $nodeybits (@MiddleNodeBits) { # [1] Works but only gives me "Random1". Expected Random1,2 and 3 print $nodeybits->findnodes('//@GeneralID')->string_value . "\n"; # [2] Works...but gives me the entire XML once. Not very useful. print $nodeybits->findnodes('//*')->string_value . "\n"; ############################################## # No output from lines below-Not sure why? ############################################## # [3] print $nodeybits->findnodes('//CannotGetTagA')->string_value . "\ +n"; # [4] print $nodeybits->findnodes('/RootTag/MiddleTag[1]/MyImportantNod +e[1]/CannotGetTagA[1]/text()')->string_value . "\n"; # [5] print $nodeybits->findnodes('theNS:RootTag/MiddleTag[1]/MyImportan +tNode[1]/CannotGetTagA[1]/text()')->string_value . "\n"; # [6] print $nodeybits->findnodes('//RootTag/MiddleTag[1]/MyImportantNod +e/CannotGetTagA/text()')->string_value . "\n"; } exit(0);

Comment on LibXML, Namespaces, xpath expression - This should be simple.
Download Code
Re: LibXML, Namespaces, xpath expression - This should be simple.
by ikegami (Pope) on Sep 03, 2011 at 15:39 UTC

    First, there's an error in the XML.

    <MyImportantNode> xsi:type="foo" GeneralID="Random1" >
    should be
    <MyImportantNode xsi:type="foo" GeneralID="Random1" >

    (Same goes for the other two.)

    4 of the 5 paths you tried in the loop start with "/". That means they start looking at the root of the tree. That instantly makes them wrong since you obviously want something that's in or below $nodeybits.

    Another major problem is that you look for the nodes in the wrong namespace throughout the body of the loop. You don't even use the xpc!


    <?xml version="1.0" encoding="UTF-8"?> <RootTag SchemaVersion="1.1" xmlns="http://www.wow.com/BlahML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocati +on="http://www.wow.com/BlahML BlahML1_1.xsd"> <MyTagA Tag="XXX">Some Text A</MyTagA> <MyTagB Tag="XXX">Some Text A</MyTagB> <MiddleTag xsi:type="foo" magicvalue="wow"> <MyImportantNode xsi:type="foo" GeneralID="Random1" > <CannotGetTagA>YYYYYYYYYYY</CannotGetTagA> <CannotGetTagB>ZZZZZZZZZZZ</CannotGetTagB> </MyImportantNode> <MyImportantNode xsi:type="foo" GeneralID="Random2" > <CannotGetTagA>YYYYYYYYYY22</CannotGetTagA> <CannotGetTagB>ZZZZZZZZZZ22</CannotGetTagB> </MyImportantNode> <MyImportantNode xsi:type="foo" GeneralID="Random3" > <CannotGetTagA>YYYYYYYYY333</CannotGetTagA> <CannotGetTagB>ZZZZZZZZZ333</CannotGetTagB> </MyImportantNode> </MiddleTag> </RootTag>
    use strict; use warnings; use feature qw( say ); use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file('test.xml'); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs(theNS => 'http://www.wow.com/BlahML'); for my $important_node ( $xpc->findnodes('//theNS:MiddleTag/theNS:MyImportantNode') ) { say $important_node->getAttribute('GeneralID'); for my $cannot_get_a_node ( $xpc->findnodes('theNS:CannotGetTagA', $important_node) ) { say $cannot_get_a_node->textContent(); } }
Re: LibXML, Namespaces, xpath expression - This should be simple.
by FalseVinylShrub (Chaplain) on Sep 03, 2011 at 15:59 UTC

    Hi

    There was something wrong with you example XML, the MyImportantNode elements were closed before defining the attributes.

    Assuming that was an error in posting, I think that your confusion comes from a number of reasons:

    In 1 it is something to do with the way you are chaining findnodes and string_value. If you replace it with:

    say "[1a]" . $nodeybits->findvalue('//@GeneralID');

    you'll get Random1Random2Random3. If you want to deal with each value, you'll need a loop:

    say "[1b]" . $_->findvalue('.') foreach $nodeybits->findnodes('*/@GeneralID');

    Having said that, it's a bit confusing that you're using nodeybits (the MiddleTag node) but then running XPath expressions beginning with "//", which will start at the root.

    2 does what I'd expect, given the above: the first tag that matches that will be the root element, and calling string_value on that will return the entire text of the file.

    The remaining ones are all because you're using $nodeybits. This is not an XPathContext object. For the examples you've given, you could use $xpc. But presumably this is cut down from a bigger program where you're doing XPath relative to the node you're looking at in the loop.

    EDIT see ikegami's reply

    my $inner_xpc = XML::LibXML::XPathContext->new($nodeybits); $inner_xpc->registerNs( theNS => 'http://www.wow.com/BlahML'); # [3] print $inner_xpc->findnodes('//theNS:CannotGetTagA')->string_value . " +\n";

    Worked for me and you could use $inner_xpc for your remaining queries, being careful to be consistent about using the namespace prefix (in 5, you have theNS:RootTag but miss it off for all the other element names).

    HTH

    FalseVinylShrub

    Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      Creating a new xpc for every findnodes is not the way to go.

      my $inner_xpc = XML::LibXML::XPathContext->new($nodeybits); $inner_xpc->registerNs( theNS => 'http://www.wow.com/BlahML'); $inner_xpc->findnodes(...)

      can be written as

      $xpc->findnodes(..., $nodeybits)
        Hi All, Thank-you so much for your help with this.

        I apologise for the time taken to reply but I had assumed the site would email me by default if I got a reply..and as I hadn't, assumed no one had responded. (Possible server issue my end)

        Setting up the namespace in the way shown worked perfectly. This site should show PayPal options for each submit. I'd happily send each Monk a dollar or two for the mental sanity you afforded me.

        Thank you so much.

        PS: Matt Seargent was right. LibXML is super quick when used properly.

        • My Data Pool using XPath - Parsing took 3 hours
        • My Data Pool using LibXML - Parsing takes 20 minutes. - Fantastic!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://923996]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-04-19 18:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (483 votes), past polls