Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Just another Perl shrine
 
PerlMonks  

Re: HTML::TreeBuilder::XPath not loading the complete $page

by tangent (Chaplain)
on Mar 19, 2013 at 19:46 UTC ( #1024376=note: print w/ replies, xml ) Need Help??


in reply to HTML::TreeBuilder::XPath not loading the complete $page

Can you show us the HTML content? Is it possible that some of those 19 divs are nested?


Comment on Re: HTML::TreeBuilder::XPath not loading the complete $page
Re^2: HTML::TreeBuilder::XPath not loading the complete $page
by Lord Gartlar (Initiate) on Mar 19, 2013 at 20:11 UTC

    I looked line by line at the output html file containing $response->content

    It seems that sometimes there's an internal error printed inside another HTML tag...

    In other words, the $response->content goes as follows:

    <html> <head> ... ... </head> <body> ... ... ... <DIV></DIV> # Div number 12 <HTML> <HEAD> </HEAD> <BODY> <p>You have an error blah blah blah</p> </BODY> </HTML> <DIV></DIV> # Div number 13 and so on until number 19 ... ... </body> </html>

    The problem is that it's not inside any iframe or something like that, so the onl way out that comes to my mind is use a while to look for such issues and take them out, then regenerate the output and go on...

    Does TreeBuilder have an option to avoid this things?? Google is not helping me

      Are you sure there isn't something else going on? When I use your sample content it still works:
      my $content = q| <html> <head> </head> <body> <DIV>Div number 12</DIV> <HTML> <HEAD> </HEAD> <BODY> <p>You have an error blah blah blah</p> </BODY> </HTML> <DIV>Div number 13</DIV> </body> </html> |; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content( $content ); my @divs = $tree->findnodes( '/html/body/div' ); for my $div (@divs) { print $div->as_HTML . "\n"; }
      Output:
      <div>Div number 12</div> <div>Div number 13</div>

        There are some errors displayed there inside the new HTML...

        As mentioned, I tried deleting them with a mere while and it finally worked...

      Why don't you post html that proves your point? TreeBuilder has options such as ignore_unknown and implicit_tags ...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1024376]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2014-04-20 12:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls