HTML::TreeBuilder::XPath not loading the complete $page

Lord Gartlar has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am trying to parse some URL with HTML TreeBuilder XPATH.

I retrieve an URL using LWP UserAgent which is loaded to $response->content.

Here's the code that loads that variable to a $tree instance:

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($response->content);
my $t = $tree->findnodes(qq{/html/body/form/div});
print $t->size;
[download]

If I print the content of $response->content to a plain html file and open it using Firefox, the total amount of /html/body/form/div's is 19.

However, printing $t->size results in only 12..

Why is this happening??

$tree is ignoring most of my divs and so I can't retrieve data from them....

Thanks!

Comment on HTML::TreeBuilder::XPath not loading the complete $page Select or Download Code

Replies are listed 'Best First'.
Re: HTML::TreeBuilder::XPath not loading the complete $page by Corion (Patriarch) on Mar 19, 2013 at 19:58 UTC
My guess is that the page creates additional DIVs through Javascript. Have you checked the page with Javascript disabled in Firefox?	[reply]
Re: HTML::TreeBuilder::XPath not loading the complete $page by tangent (Parson) on Mar 19, 2013 at 19:46 UTC
Can you show us the HTML content? Is it possible that some of those 19 divs are nested?	[reply]
Re^2: HTML::TreeBuilder::XPath not loading the complete $page by Lord Gartlar (Initiate) on Mar 19, 2013 at 20:11 UTC
I looked line by line at the output html file containing $response->content It seems that sometimes there's an internal error printed inside another HTML tag... In other words, the $response->content goes as follows: `<html> <head> ... ... </head> <body> ... ... ... <DIV></DIV> # Div number 12 <HTML> <HEAD> </HEAD> <BODY> <p>You have an error blah blah blah</p> </BODY> </HTML> <DIV></DIV> # Div number 13 and so on until number 19 ... ... </body> </html>` [download] The problem is that it's not inside any iframe or something like that, so the onl way out that comes to my mind is use a while to look for such issues and take them out, then regenerate the output and go on... Does TreeBuilder have an option to avoid this things?? Google is not helping me	[reply] [d/l]
Re^3: HTML::TreeBuilder::XPath not loading the complete $page by tangent (Parson) on Mar 19, 2013 at 21:00 UTC
Are you sure there isn't something else going on? When I use your sample content it still works: `my $content = q\| <html> <head> </head> <body> <DIV>Div number 12</DIV> <HTML> <HEAD> </HEAD> <BODY> <p>You have an error blah blah blah</p> </BODY> </HTML> <DIV>Div number 13</DIV> </body> </html> \|; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content( $content ); my @divs = $tree->findnodes( '/html/body/div' ); for my $div (@divs) { print $div->as_HTML . "\n"; }` [download] Output: `<div>Div number 12</div> <div>Div number 13</div>` [download]	[reply] [d/l] [select]
Re^4: HTML::TreeBuilder::XPath not loading the complete $page by Lord Gartlar (Initiate) on Mar 19, 2013 at 21:26 UTC
Re^3: HTML::TreeBuilder::XPath not loading the complete $page by Anonymous Monk on Mar 20, 2013 at 06:56 UTC
Why don't you post html that proves your point? TreeBuilder has options such as ignore_unknown and implicit_tags ...	[reply]


Perl Monk, Perl Meditation
	PerlMonks