Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Mixed XML content with XML::LibXML

by lobeydosser (Novice)
on Sep 29, 2008 at 13:43 UTC ( #714351=perlquestion: print w/ replies, xml ) Need Help??
lobeydosser has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I'm trying to work with mixed content XML files using LibXML but I have a problem. For an example data set such as :
<root> <node>text1 <child1>data</child1> <child2>data2</child2> </node> <node>text2 <child1>blah</child1> ..etc.. </node> </root>
How can I use the text value of nodes without also getting all of the information on child1 and child2 as well? Using the following code :
use XML::LibXML; my $file = $ARGV[0]; print $file."\n"; &parse($file); sub parse { my $filename = shift; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file("$filename"); foreach my $node ($doc->findnodes('/root/node')) { print "string value :".$node->string_value."\n"; print "to literal :".$node->to_literal."\n"; print "node name :".$node->nodeName()."\n"; } }
I see the following output :
string value :text1 data data2 to literal :text1 data data2 node name :node string value :text2 blah blah2 to literal :text2 blah blah2 node name :node
Ideally I would like to access just the values 'text1' and 'text2' Hope this all makes sense, thanks in advance for any answers.

Comment on Mixed XML content with XML::LibXML
Select or Download Code
Re: Mixed XML content with XML::LibXML
by ikegami (Pope) on Sep 29, 2008 at 13:53 UTC
      Then again, I think that would return four nodes for
      <root> <node> text1a <child1>data</child1> <child2>data2</child2> text1b </node> <node> text2a <child1>blah</child1> text2b </node> </root>

      If that's a problem, find /root/node elements, then for each of those elements, find child::text().

      Sorry, I don't have XML::LibXML here to test.

        Thanks guys , managed to get it working now.
        use XML::LibXML; my $file = $ARGV[0]; print $file."\n"; &parse($file); sub parse { my $filename = shift; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file("$filename"); foreach my $node ($doc->findnodes('/root/node')) { print "textval :".$node->findnodes('./text()')->to_literal."\n"; } }
        This does the trick. The particular data set I'm using should never have more than one text value mixed in with the child nodes.
Re: Mixed XML content with XML::LibXML
by ForgotPasswordAgain (Deacon) on Sep 29, 2008 at 13:58 UTC
    I'd first of all call $node->normalize, to make sure the text nodes aren't weirdly split apart. Then call $node->firstChild (both documented in perldoc XML::LibXML::Node). You might also compare $node->nodeType == XML_TEXT_NODE (see perldoc XML::LibXML "EXPORT TAGS" section for all the node types) to make sure it's a text node. I believe the "node name" of text nodes is '#text', so you could even try that.
Re: Mixed XML content with XML::LibXML
by CountZero (Bishop) on Sep 29, 2008 at 18:33 UTC
    What about this?
    use strict; use XML::LibXML; parse(); sub parse { my $filename = shift; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string(<<'EOT'); <root> <node>text1 <child1>data</child1> <child2>data2</child2> </node> <node>text2 <child1>blah</child1> <child2>data3</child2> </node> </root> EOT foreach my $node ($doc->findnodes('/root/node/text()[1]')) { print "string value :".$node->string_value."\n"; print "to literal :".$node->to_literal."\n"; print "node name :".$node->nodeName()."\n"; print "____________________\n"; } }

    Output:

    string value :text1 to literal :text1 node name :#text ____________________ string value :text2 to literal :text2 node name :#text ____________________

    ikegami was close, but he missed adding the proximity position of the first member of the node-set (the [1]).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      ikegami was close, but he missed adding the proximity position of the first member of the node-set (the [1]).

      I most definitely did not want to drop some of the text.

        For lack of a DTD, you could not know whether there was to be additional text-data, but from the example given it seemed that there was only one such item and the OP confirmed later that this was indeed the case.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://714351]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (16)
As of 2014-08-29 12:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls