http://www.perlmonks.org?node_id=1200606

sumeetgrover has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

I am using HTML::TreeBuilder::XPath module, and when I try to use the text() directive in the XPath lookup, the module fails:

$tree->findnodes_as_string('/html/body//td/text()');

I get the following error:

Can't locate object method "toString" via package "HTML::TreeBuilder::XPath::TextNode" at .../lib/XML/XPathEngine.pm line 125

So I have found out that this is a known bug with this module. (See here) My question is:
Can you recommend a reliable CPAN module which I can use to parse HTML and run an XPath query on it?

Thanks a lot.

Replies are listed 'Best First'.
Re: Any Alternative to HTML::TreeBuilder::XPath?
by choroba (Cardinal) on Oct 03, 2017 at 13:06 UTC
    XML::LibXML can parse HTML if told to, but only if your HTML is more or less well formed and valid. Adding the recover option makes it a bit more robust:
    #!/usr/bin/perl use warnings; use strict; use XML::LibXML; my $dom = 'XML::LibXML'->load_html(string => << '__HTML__', recover => + 1); <html><h1>Hello world</h2></html> __HTML__ print $dom->findvalue('/html/body/h1/text()');

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks! This one works exactly how I need to query XPath. Although the HTML I am parsing is malformed (blame the other developer!), this is a really good solution!

Re: Any Alternative to HTML::TreeBuilder::XPath?
by haukex (Archbishop) on Oct 03, 2017 at 12:40 UTC

    What is your expected output? Perhaps you just need to implement a workaround:

    use warnings; use strict; use HTML::TreeBuilder::XPath; use Data::Dump; my $tree= HTML::TreeBuilder::XPath->new; $tree->parse(<<'ENDHTML'); <html><body><table> <tr> <td>Foo</td> <td>Ba<em>r</em> </td> </tr> <tr> <td>Quz</td> <td><b>Ba<i>z</i></b></td> </tr> </table></body></html> ENDHTML $tree->eof; dd map {$_->as_text} $tree->findnodes('/html/body//td'); dd $tree->findnodes_as_strings('/html/body//td'); __END__ ("Foo", "Bar", "Quz", "Baz") ("Foo", "Bar", "Quz", "Baz")
Re: Any Alternative to HTML::TreeBuilder::XPath?
by marto (Cardinal) on Oct 03, 2017 at 12:40 UTC
Re: Any Alternative to HTML::TreeBuilder::XPath?
by LanX (Saint) on Oct 03, 2017 at 12:42 UTC
    I've never used it but my guess is that you rather have an installation problem.

    Do all tests pass?

    update
    Haven't seen your remark that it's a known bug, sorry.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!