Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

HTML::Tree(Builder) in 6 minutes

by trs80 (Priest)
on Aug 03, 2003 at 17:18 UTC ( #280461=perlmeditation: print w/ replies, xml ) Need Help??

There are numerous posts regarding parsing HTML and many seem to skip over HTML::Tree(Builder), due in part to its name I believe. This is a lightening fast intro to HTML::Tree and what it can (and can't) do for you.

The "tree" is a way to represent the flow of data in a semi structured markup language such as HTML. A trees validity is directly related to the quality of the HTML, that is bad markup will get you a bad tree. It can overcome some issues, but there are several it can not. So if you have a problem with the results, validate the source HTML before you curse HTML::Tree.

HTML::Tree inherits from a couple of other modules, most notably HTML::Element. As HTML::Tree parses your content it converts each of the tags into HTML::Element objects. So when you work with an individual tag you are working with an HTML::Element object stored in your tree. Read the docs for HTML::Element if you really want to find the strength of HTML::Tree.

This sample script uses LWP to retrieve the content of a page to build our "tree" from. You can also call in content from a file, see docs for more info.
use strict; use HTML::Tree; use LWP::Simple; my $funky = "http://www.google.com"; my $content = get($funky); my $tree = HTML::Tree->new(); $tree->parse($content); print $tree->as_text;
The as_text method is inherited from the HTML::Element module. There is an as_HTML method as well. These methods, when used on the entire tree, simple walk down the tree and expand each HTML::Element object into either the text it contains (as_text) or the HTML code it represents (as_HTML).

Lets do another quick run through to show how we get what we want (a single tag in this case) out of the page.
use strict; use HTML::Tree; use LWP::Simple; my $funky = "http://www.google.com"; my $content = get($funky); my $tree = HTML::Tree->new(); $tree->parse($content); my ($title) = $tree->look_down( '_tag' , 'title' ); print $title->as_text , "\n"; print $title->as_HTML , "\n";
The '_tag' tells HTML::Tree's look_down method what 'key' to look at and the title is the value that 'key' should have. Title could be 'a' for anchor or 'img' for image, etc. If you want to capture all of a particular tags for the page you would simple use an array instead of a scalar to collect them, such as:
my @a_tags = $tree->look_down( '_tag' , 'a' );
Beyond this intro I recommend the documentation and the article the author of HTML::Tree has in The Perl Journal.

One last caveat, use HTML::Tree if you want to parse HTML not create it, if you want to create HTML use CGI or HTML::Element (or other) by itself.

I hope you enjoy HTML::Tree.

UPDATE: added readmore tags

Comment on HTML::Tree(Builder) in 6 minutes
Select or Download Code
Re: HTML::Tree(Builder) in 6 minutes
by jeffa (Chancellor) on Aug 03, 2003 at 17:52 UTC
    Excellent, but ...

    "... if you want to create HTML use CGI or HTML::Element (or other) ..."

    *cough* HTML::Template *hermph*
    *ahem* Template *cough*

    Sorry, but someone had to mention them. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Do those create HTML directly or do they rely on other modules to create the HTML tag itself? If you want to do a large scale application then by all means look into HTML::Template, and Template, but they (c|w)ould be overkill for a quick and simple one time "thing" I feel.
        They actually do neither ... they are templating modules and have no responsibility of producing valid HTML - that's up to the HTML coder. As for being overkill, well ... the more you use these tools, the quicker you get at coding with them. You can see an example that i am proud of over at 4Re: How do I extract text from an HTML page? that uses HTML::Template. The template is stored inside DATA - creating a new H::T object that uses the DATA filehandle is a snap:
        my $template = HTML::Template->new(filehandle => \*DATA);
        For the Template-Toolkit quick and simple scripts, check out Inline::TT, it's slow as hell, but when you combine it with Class::DBI you get some amazing results. I am nearly finished with my C::D mini-tut that will demonstrate using C::D with multiple tables, but here is a snippet just to show you the power of the Class::DBI and Template combo. (and by the way, i learned most of this from How to Avoid Writing Code and the poop-group mailing list)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
•Re: HTML::Tree(Builder) in 6 minutes
by merlyn (Sage) on Aug 03, 2003 at 20:58 UTC
      XML::LibXML is very fast, but it can barely parse 1% of the web pages one can find on the Internet because it expects too strict HTML. That's why your 8-lines Perl program at the end of your column doesn't work. Tree::Builder is very slow and does not provide DOM nor XPath. I think that there is nothing in Perl that can parse real web pages while beeing fast and giving access to DOM or XPath. fred

        A little late to the party... but for future reference, HTML::TreeBuilder::XPath gives you XPath on an HTML::Tree object.

        And I agree with XML::LibXML not being great at dealing with "real" HTML.

Re: HTML::Tree(Builder) in 6 minutes
by ido50 (Scribe) on Aug 04, 2003 at 12:28 UTC
    Thank you very much for the intro, I think I got a little idea from it (And I'll get back here with it if it works out well).

    -------------------------
    Live fat, die young
Re: HTML::Tree(Builder) in 6 minutes
by princepawn (Parson) on Aug 04, 2003 at 17:53 UTC
    If you want to see HTML::TreeBuilder in action, download and read the source code to HTML::Seamstress.

    Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Re: HTML::Tree(Builder) in 6 minutes
by Kanishka.black0 (Beadle) on Nov 07, 2009 at 00:15 UTC
    Thanks for the Tuit.... This definitely help the Beginners like me ....
Re: HTML::Tree(Builder) in 6 minutes
by szabgab (Priest) on May 29, 2012 at 20:09 UTC
    The article from the Perl Journal can now be found here and here

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://280461]
Approved by broquaint
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (17)
As of 2014-07-11 15:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (230 votes), past polls