Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

•Re: HTML::Tree(Builder) in 6 minutes

by merlyn (Sage)
on Aug 03, 2003 at 20:58 UTC ( #280510=note: print w/ replies, xml ) Need Help??


in reply to HTML::Tree(Builder) in 6 minutes

Also consider XML::LibXML, which despite its name, can be coaxed into reading HTML, and then provides DOM and XPath interfaces into your HTML tree. It's also far faster than HTML::Tree, keeping the tree in C space, only converting to Perl scalars when necessary.

I wrote a column about using it to extract data from a web page.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.


Comment on •Re: HTML::Tree(Builder) in 6 minutes
Re: HTML::Tree(Builder) in 6 minutes
by Anonymous Monk on Nov 30, 2004 at 19:16 UTC
    XML::LibXML is very fast, but it can barely parse 1% of the web pages one can find on the Internet because it expects too strict HTML. That's why your 8-lines Perl program at the end of your column doesn't work. Tree::Builder is very slow and does not provide DOM nor XPath. I think that there is nothing in Perl that can parse real web pages while beeing fast and giving access to DOM or XPath. fred

      A little late to the party... but for future reference, HTML::TreeBuilder::XPath gives you XPath on an HTML::Tree object.

      And I agree with XML::LibXML not being great at dealing with "real" HTML.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://280510]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2015-07-06 21:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (83 votes), past polls