Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Seeking a more robust variant of HTML::TreeBuilder::XPath

by Paradigma (Novice)
on Jul 12, 2019 at 11:58 UTC ( #11102724=perlquestion: print w/replies, xml ) Need Help??

Paradigma has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Using this component, sometimes happens I'm limited to fully parse some more complex or broken pages. Though my element is shown in browser's element viewer, I can't reach it in TreeBuilder's tree.

I would appreciate any other element, even commercial, conceptually similar or same to TreeBuilder, which would parse more aggressively pages with complex structure. Maybe the inability to parse fully could be resolved by tuning up with TreeBuilder initialization parameters?

  • Comment on Seeking a more robust variant of HTML::TreeBuilder::XPath

Replies are listed 'Best First'.
Re: Seeking a more robust variant of HTML::TreeBuilder::XPath
by holli (Monsignor) on Jul 12, 2019 at 12:44 UTC
    Though my element is shown in browser's element viewer, I can't reach it in TreeBuilder's tree
    What you see in the browser via the developer tools and what you get when you fetch the same page (via curl, or LWP or whatever) must not necessarily be identical. Mainly because of Javascript possibly altering the site after load or something server side that looks at the user agent and is serving different content based on that. To check if that is the case look at the page source, not the current DOM.


    holli

    You can lead your users to water, but alas, you cannot drown them.

      If this is the case something like WWW::Selenium might be useful (in that it involves a "real" browser in the process and any JS logic can wreak it's evil will upon manipulate the contents).

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Seeking a more robust variant of HTML::TreeBuilder::XPath
by hippo (Canon) on Jul 12, 2019 at 12:33 UTC
    Using this component, sometimes happens I'm limited to fully parse some more complex or broken pages.

    Those are 2 quite separate problems. If the pages are merely complex then HTML::TreeBuilder::XPath should parse them. You would be helping everyone, yourself included, if you could report such bugs to the maintainer (ideally with an SSCCE) so that they can be fixed. Try to ensure that you are posting the bug against the right dist - it may be that one of the dependencies is actually at fault.

    If the pages are broken then it's quite fair for HTML::TreeBuilder::XPath to fail to parse them. Instead you need a way to fix the page before parsing. Have you tried HTML::Valid?

Re: Seeking a more robust variant of HTML::TreeBuilder::XPath
by marto (Archbishop) on Jul 12, 2019 at 12:12 UTC

    Perhaps Mojo::DOM would help, it deals with some insane things, broken HTML etc. Check it out, Super Search will find you many interesting uses.

    Update: feel free to provide some messed up example data, and what you're trying to parse out of it.

Re: Seeking a more robust variant of HTML::TreeBuilder::XPath ( htmltreexpather , HTML::TreeBuilder::LibXML )
by Anonymous Monk on Jul 13, 2019 at 01:58 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11102724]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2019-07-19 14:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?