Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

XML::LibXML - How to extract an element including the elements within?

by TravelAddict (Acolyte)
on May 09, 2019 at 19:10 UTC ( #1233525=perlquestion: print w/replies, xml ) Need Help??

TravelAddict has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to extract elements from nodes in XML using XML::LibXML, and I need to extract the elements that could be embedded within that node.

For example, I have this XML string: <doc><text>From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i="1"/> this number:</text></doc>

What I need as a result is this: From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i="1"/> this number:, but I cannot extract the elements embedded (<ph1 i="1" type="33" x="1"/> and <ph2 i="1"/>), they disappear no matter what I try.

This is the sample code to reproduce the issue:

use strict; use warnings; use XML::LibXML; my $xmlString = "<doc><text>From mobile, <ph1 i=\"1\" type=\"33\" x=\" +1\"/>dial<ph2 i=\"1\"/> this number:</text></doc>"; my $dom = XML::LibXML->load_xml(string => $xmlString); my $to_literal = $dom->to_literal('/doc/text'); print "to_literal:\t$to_literal\n"; my $findvalue = $dom->findvalue('/doc/text'); print "findvalue:\t$findvalue\n"; my $textContent = $dom->textContent('/doc/text'); print "textContent:\t$textContent\n";

This is what I get when I run this code:

to_literal: From mobile, dial this number: findvalue: From mobile, dial this number: textContent: From mobile, dial this number:

It looks like "to_literal", "findvalue" and "textContent" take out the embedded placeholders, but I want to keep them in.

Is there any way to get what I need with a simple method?

Thanks very much in advance!

TA

Replies are listed 'Best First'.
Re: XML::LibXML - How to extract an element including the elements within?
by poj (Abbot) on May 09, 2019 at 20:29 UTC

    Try joining the child nodes

    my ($node) = $dom->findnodes('/doc/text'); my $text = join '', $node->childNodes(); print "$text\n";
    poj

      Thanks poj,

      This works great for me, and I also realized that I can differentiate the type of node that I get (plain text vs. element):

      my ($node) = $dom->findnodes('/doc/text'); my @children = $node->childNodes; foreach my $child (@children) { my $type = ref($child); # $type contains the type of the node, ex: "XML::LibXML::Element" }

      From there, I can have a plain text string (no tags) and a separate list of all the elements embedded.

      Have a great day!

      TA

        my $type = ref($child);

        This is a very brittle way of doing it - it's possible that XML::LibXML::Element could be subclassed, and this check would fail, which is why I would recommend against it. It's possible to do $child->isa('XML::LibXML::Element'), but XML::LibXML provides an API to check the node type: $child->nodeType == XML_ELEMENT_NODE (admittedly not a very Perlish way of doing it, but it's based on the libxml2 API).

Re: XML::LibXML - How to extract an element including the elements within?
by tangent (Vicar) on May 09, 2019 at 20:15 UTC
    You need to find the node itself and then print out the literal value of that node using toString(). That will also print out the enclosing tags but simple regular expressions can strip them out:
    my ($node) = $dom->findnodes('/doc/text'); my $string = $node->toString; print "toString:\n$string\n"; # remove enclosing tags $string =~ s/^<[^>]+>//; $string =~ s/<[^>]+>$//; print "toString:\n$string\n";
    Output:
    toString: <text>From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i="1"/> this n +umber:</text> toString: From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i="1"/> this number:

      $string =~ s/^<[^>]+>//; $string =~ s/<[^>]+>$//;

      No, please don't use regular expressions to parse XML...

      An alternative to what poj showed with XML::LibXML::DocumentFragment:

      use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(string => q{<doc><text>From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i=" +1"/> this number:</text></doc>}); for my $node ($doc->findnodes('/doc/text')) { my $frag = $doc->createDocumentFragment; $frag->appendChild($_->cloneNode(1)) for $node->childNodes; print $frag, "\n"; } __END__ From mobile, <ph1 i="1" type="33" x="1"/>dial<ph2 i="1"/> this number:

        Hello haukex

        I agree with you regarding the use of RE in this case. I was trying to avoid this, and got 2 good working solutions. I preferred the one from poj because it's simpler, however I'm sure that if I explore and understand more your solution I would find some possible ways to solve other issues that I've not met yet.

        Have a great day too!

        TA

      Thanks tangent

      I was actually thinking about doing what you suggest, except that I was thinking this: $string =~ s/^(.*?)<text>(.*?)</text>(.*?)$/$2/sm;

      I finally decided to go with the solution that poj suggested, it looks quite clean.

      TA

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1233525]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2019-05-25 06:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (151 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!