|Pathologically Eclectic Rubbish Lister|
There are numerous posts regarding parsing HTML and many seem to skip over HTML::Tree(Builder), due in part to its name I believe. This is a lightening fast intro to HTML::Tree and what it can (and can't) do for you.
The "tree" is a way to represent the flow of data in a semi structured markup language such as HTML. A trees validity is directly related to the quality of the HTML, that is bad markup will get you a bad tree. It can overcome some issues, but there are several it can not. So if you have a problem with the results, validate the source HTML before you curse HTML::Tree.
HTML::Tree inherits from a couple of other modules, most notably HTML::Element. As HTML::Tree parses your content it converts each of the tags into HTML::Element objects. So when you work with an individual tag you are working with an HTML::Element object stored in your tree. Read the docs for HTML::Element if you really want to find the strength of HTML::Tree.
This sample script uses LWP to retrieve the content of a page to build our "tree" from. You can also call in content from a file, see docs for more info.
The as_text method is inherited from the HTML::Element module. There is an as_HTML method as well. These methods, when used on the entire tree, simple walk down the tree and expand each HTML::Element object into either the text it contains (as_text) or the HTML code it represents (as_HTML).
Lets do another quick run through to show how we get what we want (a single tag in this case) out of the page.
The '_tag' tells HTML::Tree's look_down method what 'key' to look at and the title is the value that 'key' should have. Title could be 'a' for anchor or 'img' for image, etc. If you want to capture all of a particular tags for the page you would simple use an array instead of a scalar to collect them, such as:
One last caveat, use HTML::Tree if you want to parse HTML not create it, if you want to create HTML use CGI or HTML::Element (or other) by itself.
I hope you enjoy HTML::Tree.
UPDATE: added readmore tags