XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

bobn has asked for the wisdom of the Perl Monks concerning the following question:

So I started playign around with XML parsing (well HTML but it's well enough formed I can use XML parsers on it). I ran into something on the Perl side of things I don't understand.

I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements).

I'm getting this with XML::LibXML::XPathContext, but it happens with XML::XPath as well.

The event driven parsers I've tried don't seem to have this issue - they think that Text belongs to the innermost containing element, just like I do. lxml.etree in python, their binding for libxml2, does not do this, (though it definitely has oddities of it's own - check out "tail text" sometime, it's a doozy!).

I'm going to stop now, 'coz I'm becoming increasingly sure I'm just missing something stupidly.

Is it supposed to do this, and if so, how do I get at just the text for the outermost element of my node?

#!/usr/bin/perl
use XML::LibXML::XPathContext;

our $contents = <<EOT;
<html> 
    <head> <title>Title_Text</title> </head> 
    <body> 
        <p>paragraph_text</p>
        <div> 
            <div> 
                innnermost_text
            </div>
        </div> 
    </body> </html>
EOT

open my $fh, '>', './x.html';
print $fh $contents;
close $fh;
my $init_node = XML::LibXML->new->parse_file('./x.html');
my $xp = XML::LibXML::XPathContext->new($init_node);

my $i= 0;
my $nodeset = $xp->findnodes('//*');
for my $node ($nodeset->get_nodelist) 
{
    my $elname = $node->getName();
    print qq[<$elname> node - $i\n];
    my $text = '';
    $text = $node->string_value();  
       # this brings in text of 
       # *all* descendant nodes
    $text =~ s/(\s)+/$1/msg;
    print 'Text = ', $text, "\n";
    $i++;
}
[download]

Produces:

<html> node - 0
Text =  Title_Text paragraph_text innnermost_text 
<head> node - 1
Text =  Title_Text 
<title> node - 2
Text = Title_Text
<body> node - 3
Text =  paragraph_text innnermost_text 
<p> node - 4
Text = paragraph_text
<div> node - 5
Text =  innnermost_text 
<div> node - 6
Text =  innnermost_text
[download]

--Bob Niederman,

All code given here is UNTESTED unless otherwise stated.

Comment on XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? Select or Download Code

Replies are listed 'Best First'.
Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by haukex (Archbishop) on Aug 08, 2020 at 07:25 UTC
I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements). This makes sense to me. Consider this piece of HTML: `<p>Hello, <b>World</b>!</p>` - the `<p>` element has three child nodes: a text node (`"Hello, "`), the `<b>` element, and another text node (`"!"`). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be `"Hello, World!"` instead of `"Hello, !"`? Otherwise, what question are you asking? If it's "what are the children of this `<p>` element that are text nodes", you'll have to code that explicitly, and you may get any number of text nodes (in the aforementioned example, it's two, but consider that any whitespace like newlines and indentation are text nodes too, e.g. the `<body>` in your example has three text children, all whitespace). Two ways to do that are to iterate over the `childNodes` of a node, checking their `nodeType` for `XML_TEXT_NODE` and `XML_CDATA_SECTION_NODE`. Or, use an XPath expression like `'//p/child::text()'`. OTOH, event-based parsers will return nodes as they encounter them. Perhaps you could explain what you're trying to do and what your expected output is? `$node->string_value();` Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your `$node`s are XML::LibXML::Elements), you should use `textContent` instead. Note that you don't need XML::LibXML::XPathContext unless the document you're parsing contains namespaces; the regular XML::LibXML::Node has a `findnodes` too. Minor edits.	[reply] [d/l] [select]
Re^2: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by bobn (Chaplain) on Aug 09, 2020 at 00:11 UTC
Consider this piece of HTML: `<p>Hello, <b>World</b>!</p>` - the `<p>` element has three child nodes: a text node ("Hello, "), the `<b>` element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"? Actually, I do believe that the text "World" belongs to the b element and, therefore, not to the p element. Printing out specifically the p element's text should not include it. Sure as hell, printing the text for `<html>` should not print out all the text of all the descendent nodes. I've tried 3 other pieces of code - HTML::Parser and, in Python lxml.etree (bindings to libxml2, as is XML::LibXML) and xml.parsers.expat, comparable to HTML::Parser. They all agree that text belongs to the innermost containing element, and no other. (Well except lxml.etree, which thinks that elements mixed in with the text of a parent element somehow suck up the text after them in something known as "tail text" - I never heard of it before and it's really hard to find anything about it on the internet that isn't associated with lxml and Python. I think they just made that crap up.) So that's where I am, 3 other pieces of software disagree with this one - and I can't see that I've done anything incorrectly. --Bob Niederman, http://bob-n.com All code given here is UNTESTED unless otherwise stated.	[reply] [d/l] [select]
Re^3: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by haukex (Archbishop) on Aug 09, 2020 at 08:27 UTC
Actually, I do* believe that the text "World" belongs to the b element and, therefore, not to the p element.* Sure, that's up to you. I can't speak to how other modules implemented it, but I'd refer you to the libxml2 documentation, and the Document Object Model Specification for all the "official" details. Anyway, I described two ways you can get the text nodes of the current node. Using the XPath expression I showed is probably easiest. I can't really say more since you haven't described what it is you're trying to do with the document. use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( string => <<'EOT' ); <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT for my $node ($doc->findnodes('//*')) { print "<<<", $node->nodeName, ">>>\n"; my @texts = map { $_->data } $node->findnodes('./text()'); use Data::Dump; dd @texts; # Debug } __END__ <<<html>>> (" \n ", " \n ", " ") <<<head>>> (" ", " ") <<<title>>> "Title_Text" <<<body>>> (" \n ", "\n ", " \n ") <<<p>>> "paragraph_text" <<<div>>> (" \n ", "\n ") <<<div>>> " \n innnermost_text\n " [download] You could also use XML::LibXML::SAX to get an event-based parser.	[reply] [d/l]
Re^4: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by bobn (Chaplain) on Aug 10, 2020 at 06:01 UTC