Re: can't extract node with HTML::TreeBuilder::XPath

Per spec, the xpath is wrong. Given the following HTML:

<table>
  <tr>
    <td>Foo</td>
  </tr>
</table>
[download]

The correct xpath to select the table cell is along the lines of //table/tbody/tr/td. Yes, there's an invisible <tbody> element in there! A standards-compliant HTML parser will always insert the <tbody> tag for you if you miss it out.

The /a/h3 part of the xpath is an interesting feature too. In HTML 4.x, <h3> is not a permitted child of <a>. What exactly to do when encountering such an element is undefined. Some parsers may close the <a> element early so that the <h3> ends up as a sibling of it rather than the child of it.

But under HTML 5 rules, <h3> is a permitted child of <a>. What does HTML::TreeBuilder do? Who knows!? HTML::TreeBuilder's documentation is pretty vague.

This is exactly the sort of reasons I maintain HTML::HTML5::Parser which is a fork of a third-party non-CPAN HTML5 parser, ported to run on top of XML::LibXML.

FWIW, this works for me...

use 5.010;
use PerlX::MethodCallWithBlock;
use Web::Magic -quotelike => 'web';

my @headings = web <http://docstore.mik.ua/orelly/perl4/cook/ch22_07.h
+tm>
    -> assert_success
    -> querySelectorAll("h3")
    -> map { $_->textContent };

say $headings[0];
[download]

perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Comment on Re: can't extract node with HTML::TreeBuilder::XPath Select or Download Code

Replies are listed 'Best First'.
Re^2: can't extract node with HTML::TreeBuilder::XPath by saunderson (Novice) on Jul 30, 2012 at 11:14 UTC
i was not aware that the html 4.x specs are so strict. I thought that chrome shows me the xpaths based on the present html file and does not create something new that is not found in the data but is then compliant with the html standards. So my interim conclusion is, that HTML::TreeBuilder doesn't care about any specs and just analyse the underlying html code. Which is straightforward for me, someone who doesn't care about any specs :) . But i got your point. Specs are essential to have a common basis, so a html parser with the specs in mind is always preferable. Thanks for your detailed explanation and your regards to HTML::HTML5::Parser	[reply]
Re^2: can't extract node with HTML::TreeBuilder::XPath by Anonymous Monk on Jul 30, 2012 at 03:47 UTC
What does HTML::TreeBuilder do? Who knows!? I KNOW! It tells you to read the source, how awful :) htmltreexpather.pl works rather well to spit out xpaths that TreeBuilder::XPath will like :)	[reply]
Re^3: can't extract node with HTML::TreeBuilder::XPath by saunderson (Novice) on Jul 30, 2012 at 11:27 UTC
What does HTML::TreeBuilder do? Who knows!? I KNOW! It tells you to read the source, how awful :) I second that. A specs compatible HTML::TreeBuilder::XPath that works with the xpaths extracted with a common browser would definitely a simplification....	[reply]
Re^4: can't extract node with HTML::TreeBuilder::XPath by Anonymous Monk on Aug 01, 2012 at 03:34 UTC
I second that. A specs compatible HTML::TreeBuilder::XPath that works with the xpaths extracted with a common browser would definitely a simplification.... I was being sarcastic :) HTML::HTML5::Parser isn't documented much better than HTML::TreeBuilder -- you have to read the source just the same FYI, HTML::TreeBuilder::Xpath just tacks on an xpath-1 engine onto a TreeBuilder tree -- common browser addons commonly modify the DOM --- its usually only @class and @id attributes you're interested in , not absolute paths htmltreexpather.pl works with the actual tree that HTML::TreeBuilder builds, no browser required :)	[reply]
Re^5: can't extract node with HTML::TreeBuilder::XPath by tobyink (Canon) on Aug 01, 2012 at 06:35 UTC
Re^6: can't extract node with HTML::TreeBuilder::XPath by Anonymous Monk on Aug 01, 2012 at 07:15 UTC
Some notes below your chosen depth have not been shown here


go ahead... be a heretic
	PerlMonks