Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
There are numerous posts regarding parsing HTML and many seem to skip over HTML::Tree(Builder), due in part to its name I believe. This is a lightening fast intro to HTML::Tree and what it can (and can't) do for you.

The "tree" is a way to represent the flow of data in a semi structured markup language such as HTML. A trees validity is directly related to the quality of the HTML, that is bad markup will get you a bad tree. It can overcome some issues, but there are several it can not. So if you have a problem with the results, validate the source HTML before you curse HTML::Tree.

HTML::Tree inherits from a couple of other modules, most notably HTML::Element. As HTML::Tree parses your content it converts each of the tags into HTML::Element objects. So when you work with an individual tag you are working with an HTML::Element object stored in your tree. Read the docs for HTML::Element if you really want to find the strength of HTML::Tree.

This sample script uses LWP to retrieve the content of a page to build our "tree" from. You can also call in content from a file, see docs for more info.
use strict; use HTML::Tree; use LWP::Simple; my $funky = "http://www.google.com"; my $content = get($funky); my $tree = HTML::Tree->new(); $tree->parse($content); print $tree->as_text;
The as_text method is inherited from the HTML::Element module. There is an as_HTML method as well. These methods, when used on the entire tree, simple walk down the tree and expand each HTML::Element object into either the text it contains (as_text) or the HTML code it represents (as_HTML).

Lets do another quick run through to show how we get what we want (a single tag in this case) out of the page.
use strict; use HTML::Tree; use LWP::Simple; my $funky = "http://www.google.com"; my $content = get($funky); my $tree = HTML::Tree->new(); $tree->parse($content); my ($title) = $tree->look_down( '_tag' , 'title' ); print $title->as_text , "\n"; print $title->as_HTML , "\n";
The '_tag' tells HTML::Tree's look_down method what 'key' to look at and the title is the value that 'key' should have. Title could be 'a' for anchor or 'img' for image, etc. If you want to capture all of a particular tags for the page you would simple use an array instead of a scalar to collect them, such as:
my @a_tags = $tree->look_down( '_tag' , 'a' );
Beyond this intro I recommend the documentation and the article the author of HTML::Tree has in The Perl Journal.

One last caveat, use HTML::Tree if you want to parse HTML not create it, if you want to create HTML use CGI or HTML::Element (or other) by itself.

I hope you enjoy HTML::Tree.

UPDATE: added readmore tags

In reply to HTML::Tree(Builder) in 6 minutes by trs80

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (7)
    As of 2014-07-31 07:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (245 votes), past polls