Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

HTML::TreeBuilder::LibXML creates multiple copies of the same result

by password (Beadle)
on Mar 10, 2018 at 20:36 UTC ( #1210635=perlquestion: print w/replies, xml ) Need Help??
password has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

it seems HTML::TreeBuilder::LibXML finds the same node 6 times (3 times if the new lines are removed in the html).

Shouldn't it be just 1 result?

Thank you!!!

use strict; use warnings; use diagnostics; use HTML::TreeBuilder::LibXML; my $body; while (<DATA>) { $body .= $_; } my $tree = HTML::TreeBuilder::LibXML->new_from_content($body); my $xpath = '//div[@class="ccc"]/node()'; my @superCats = $tree->findnodes($xpath); $tree->delete; for my $superCat(@superCats) { my $sc = ${$superCat->parent}{'node'}; print "$sc\n\n"; } exit; __DATA__ <div class="ccc"> <img src=""><h2>Hello</h2> <div class='s'> <a href="/">All</a> </div> </div>
  • Comment on HTML::TreeBuilder::LibXML creates multiple copies of the same result
  • Download Code

Replies are listed 'Best First'.
Re: HTML::TreeBuilder::LibXML creates multiple copies of the same result
by haukex (Abbot) on Mar 10, 2018 at 20:56 UTC
    my $xpath = '//div[@class="ccc"]/node()';

    The way I understand that expression is "match any node of any kind that is a direct child of any <div> elements with a class attribute equal to ccc." If I remove all the whitespace from the XML, the matching <div> has three children: <img src="">, <h2>Hello</h2>, and <div class='s'>. And since node() matches any kind of nodes, including text nodes, that's what it's matching when you put the whitespace back in. You can see all of this in action if you put print "[[",$superCat->as_XML,"]]\n"; as the first thing in your loop. In other words, your XPath is behaving correctly. If you only want to match the <div class="ccc">, change the expression to //div[@class="ccc"].

    (Also note that your XML is not valid, the <img> tag isn't closed.)

      OMG thank you so much!!! As I recall, I copied "/node()" from someone's code where they called it a "hack" to return the html, and not just values. So I dragged it through a few of my scripts, and never had a problem until now. I guess I was lucky with the ways other htmls were formatted. Unless I've lost something while copying the parser from my older parser. Thank you thank you thank you!!!
Re: HTML::TreeBuilder::LibXML creates multiple copies of the same result
by Anonymous Monk on Mar 10, 2018 at 22:36 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210635]
Approved by haukex
Front-paged by Corion
help
Chatterbox?
and the monks are chillaxin'...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2018-06-25 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (128 votes). Check out past polls.

    Notices?