Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
There's more than one way to do things
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I've been happily using this module for a few months. If you dislike code that (ab)uses regular expressions to parse HTML, this module could be what you're looking for!

TreeBuilder uses HTML::Parser under the hood, and at the moment is fairly tightly coupled to HTML::Element, since it builds a tree of those objects if the parse is successful. (The author spoke recently on the libwww mailing list about making the module capable of building a tree of, say, subclassed HTML::Elements.)

The killer feature of this module is that it tries to parse HTML as a browser would, rather than treating all input HTML as supposedly perfectly compliant documents---which the majority of them are not! This is extremely useful. I have not seen a HTML parser for any other language that does anything like this.

Even though you'll use HTML::TreeBuilder, most of the functionality you'll want to use is in HTML::Element. The look_down() method is very useful---called on an Element, it searches down the tree looking for Elements that match a list of criteria. It's possible to specify a code reference as an argument (other forms of arguments are supported); Elements that pass the sub are returned (actually, in scalar context the first such Element is returned). Since look_down (and its sister, look_up, among many others) returns an Element, it's easy to search on successively more specific criteria for just what you want, and the code (written correctly) will keep working even if the HTML changes (I've used this pretty successfully to deduce the form contents required to fake a HTTPS login to HotMail---I'd post it here but there is too much LWP clutter in the way of what should be presented to show how this module shines).

The module also provides Tree cloning, cutting, and splicing functionality, much like you'd expect from a Document Object Model in other languages (or even Perl!). TreeBuilder objects can be converted to and from HTML and XML Element trees using the HTML::DOMbo module, by the same author. (I haven't used this functionality myself...yet.)

There are a few slight downsides to the module---at the moment it can't be usefully subclassed (a very minor problem); it's probably not as fast as searching your HTML with a regex; it may not even be as fast as `grepping' through parsed HTML via HTML::Parser directly. However I had to work with it quite extensively before I found any of these things even slightly problematic.

The author, Sean M. Burke <sburke@spinn.net>, maintains the code well, and is ready to answer questions on the LWP mailing list.

An excellent module that anyone dealing with HTML should become familiar with.


In reply to HTML::TreeBuilder by Nooks

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (9)
    As of 2014-04-16 23:10 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (436 votes), past polls