Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

by bobn (Chaplain)
on Aug 08, 2020 at 02:52 UTC ( #11120492=perlquestion: print w/replies, xml ) Need Help??

bobn has asked for the wisdom of the Perl Monks concerning the following question:

So I started playign around with XML parsing (well HTML but it's well enough formed I can use XML parsers on it). I ran into something on the Perl side of things I don't understand.

I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements).

I'm getting this with XML::LibXML::XPathContext, but it happens with XML::XPath as well.

The event driven parsers I've tried don't seem to have this issue - they think that Text belongs to the innermost containing element, just like I do. lxml.etree in python, their binding for libxml2, does not do this, (though it definitely has oddities of it's own - check out "tail text" sometime, it's a doozy!).

I'm going to stop now, 'coz I'm becoming increasingly sure I'm just missing something stupidly.

Is it supposed to do this, and if so, how do I get at just the text for the outermost element of my node?

#!/usr/bin/perl use XML::LibXML::XPathContext; our $contents = <<EOT; <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT open my $fh, '>', './x.html'; print $fh $contents; close $fh; my $init_node = XML::LibXML->new->parse_file('./x.html'); my $xp = XML::LibXML::XPathContext->new($init_node); my $i= 0; my $nodeset = $xp->findnodes('//*'); for my $node ($nodeset->get_nodelist) { my $elname = $node->getName(); print qq[<$elname> node - $i\n]; my $text = ''; $text = $node->string_value(); # this brings in text of # *all* descendant nodes $text =~ s/(\s)+/$1/msg; print 'Text = ', $text, "\n"; $i++; }
Produces:
<html> node - 0 Text = Title_Text paragraph_text innnermost_text <head> node - 1 Text = Title_Text <title> node - 2 Text = Title_Text <body> node - 3 Text = paragraph_text innnermost_text <p> node - 4 Text = paragraph_text <div> node - 5 Text = innnermost_text <div> node - 6 Text = innnermost_text

--Bob Niederman,

All code given here is UNTESTED unless otherwise stated.

  • Comment on XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
  • Select or Download Code

Replies are listed 'Best First'.
Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
by haukex (Bishop) on Aug 08, 2020 at 07:25 UTC
    I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements).

    This makes sense to me. Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"? Otherwise, what question are you asking? If it's "what are the children of this <p> element that are text nodes", you'll have to code that explicitly, and you may get any number of text nodes (in the aforementioned example, it's two, but consider that any whitespace like newlines and indentation are text nodes too, e.g. the <body> in your example has three text children, all whitespace). Two ways to do that are to iterate over the childNodes of a node, checking their nodeType for XML_TEXT_NODE and XML_CDATA_SECTION_NODE. Or, use an XPath expression like '//p/child::text()'. OTOH, event-based parsers will return nodes as they encounter them. Perhaps you could explain what you're trying to do and what your expected output is?

    $node->string_value();

    Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your $nodes are XML::LibXML::Elements), you should use textContent instead.

    Note that you don't need XML::LibXML::XPathContext unless the document you're parsing contains namespaces; the regular XML::LibXML::Node has a findnodes too.

    Minor edits.

      Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"?
      Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element. Printing out specifically the p element's text should not include it. Sure as hell, printing the text for <html> should not print out all the text of all the descendent nodes.

      I've tried 3 other pieces of code - HTML::Parser and, in Python lxml.etree (bindings to libxml2, as is XML::LibXML) and xml.parsers.expat, comparable to HTML::Parser. They all agree that text belongs to the innermost containing element, and no other. (Well except lxml.etree, which thinks that elements mixed in with the text of a parent element somehow suck up the text after them in something known as "tail text" - I never heard of it before and it's really hard to find anything about it on the internet that *isn't* associated with lxml and Python. I think they just made that crap up.) So that's where I am, 3 other pieces of software disagree with this one - and I can't see that I've done anything incorrectly.

      --Bob Niederman, http://bob-n.com

      All code given here is UNTESTED unless otherwise stated.

        Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element.

        Sure, that's up to you. I can't speak to how other modules implemented it, but I'd refer you to the libxml2 documentation, and the Document Object Model Specification for all the "official" details.

        Anyway, I described two ways you can get the text nodes of the current node. Using the XPath expression I showed is probably easiest. I can't really say more since you haven't described what it is you're trying to do with the document.

        use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( string => <<'EOT' ); <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT for my $node ($doc->findnodes('//*')) { print "<<<", $node->nodeName, ">>>\n"; my @texts = map { $_->data } $node->findnodes('./text()'); use Data::Dump; dd @texts; # Debug } __END__ <<<html>>> (" \n ", " \n ", " ") <<<head>>> (" ", " ") <<<title>>> "Title_Text" <<<body>>> (" \n ", "\n ", " \n ") <<<p>>> "paragraph_text" <<<div>>> (" \n ", "\n ") <<<div>>> " \n innnermost_text\n "

        You could also use XML::LibXML::SAX to get an event-based parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11120492]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2020-09-18 23:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (113 votes). Check out past polls.

    Notices?