Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: can't extract node with HTML::TreeBuilder::XPath

by tobyink (Canon)
on Jul 29, 2012 at 19:54 UTC ( [id://984312]=note: print w/replies, xml ) Need Help??


in reply to can't extract node with HTML::TreeBuilder::XPath

Per spec, the xpath is wrong. Given the following HTML:

<table> <tr> <td>Foo</td> </tr> </table>

The correct xpath to select the table cell is along the lines of //table/tbody/tr/td. Yes, there's an invisible <tbody> element in there! A standards-compliant HTML parser will always insert the <tbody> tag for you if you miss it out.

The /a/h3 part of the xpath is an interesting feature too. In HTML 4.x, <h3> is not a permitted child of <a>. What exactly to do when encountering such an element is undefined. Some parsers may close the <a> element early so that the <h3> ends up as a sibling of it rather than the child of it.

But under HTML 5 rules, <h3> is a permitted child of <a>. What does HTML::TreeBuilder do? Who knows!? HTML::TreeBuilder's documentation is pretty vague.

This is exactly the sort of reasons I maintain HTML::HTML5::Parser which is a fork of a third-party non-CPAN HTML5 parser, ported to run on top of XML::LibXML.

FWIW, this works for me...

use 5.010; use PerlX::MethodCallWithBlock; use Web::Magic -quotelike => 'web'; my @headings = web <http://docstore.mik.ua/orelly/perl4/cook/ch22_07.h +tm> -> assert_success -> querySelectorAll("h3") -> map { $_->textContent }; say $headings[0];
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Replies are listed 'Best First'.
Re^2: can't extract node with HTML::TreeBuilder::XPath
by saunderson (Novice) on Jul 30, 2012 at 11:14 UTC

    i was not aware that the html 4.x specs are so strict. I thought that chrome shows me the xpaths based on the present html file and does not create something new that is not found in the data but is then compliant with the html standards.

    So my interim conclusion is, that HTML::TreeBuilder doesn't care about any specs and just analyse the underlying html code. Which is straightforward for me, someone who doesn't care about any specs :) . But i got your point. Specs are essential to have a common basis, so a html parser with the specs in mind is always preferable.

    Thanks for your detailed explanation and your regards to HTML::HTML5::Parser

Re^2: can't extract node with HTML::TreeBuilder::XPath
by Anonymous Monk on Jul 30, 2012 at 03:47 UTC

    What does HTML::TreeBuilder do? Who knows!?

    I KNOW! It tells you to read the source, how awful :)

    htmltreexpather.pl works rather well to spit out xpaths that TreeBuilder::XPath will like :)

      What does HTML::TreeBuilder do? Who knows!?

      I KNOW! It tells you to read the source, how awful :)
      I second that. A specs compatible HTML::TreeBuilder::XPath that works with the xpaths extracted with a common browser would definitely a simplification....

        I second that. A specs compatible HTML::TreeBuilder::XPath that works with the xpaths extracted with a common browser would definitely a simplification....

        I was being sarcastic :) HTML::HTML5::Parser isn't documented much better than HTML::TreeBuilder -- you have to read the source just the same

        FYI, HTML::TreeBuilder::Xpath just tacks on an xpath-1 engine onto a TreeBuilder tree -- common browser addons commonly modify the DOM --- its usually only @class and @id attributes you're interested in , not absolute paths

        htmltreexpather.pl works with the actual tree that HTML::TreeBuilder builds, no browser required :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://984312]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-26 00:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found