Scraping HTML: orthodoxy and realityby grinder (Bishop)
|on Jul 08, 2003 at 06:51 UTC||Need Help??|
Every so often, some asks a question about extracting HTML. They might have a piece of text like...
... and they wonder why they get such strange results when they try to attack it with something like /<p>(.*)<\/p>/. The standard Perlmonks reply is to point people to HTML::Parser, HTML::TableExtract or even HTML::TreeBuilder, depending on the context. The main thrust of the argument is usually that regexps lead to fragile code. This is what I term The Correct Answer, but alas, In Real Life, things are never so simple, as a recent experience just showed me. You can, with minor care and effort, get perfect results with regexps, with much better performance.
I used HTML::Parser in the past to build the Perlmonk Snippets Index -- the main reason is that I wanted to walk down all the pages, in case an old node was reaped. (I think there's still a dup or two in there). From that I learnt that it's a real bear to ferry information from one callback to another. I hacked it by using global variables to keep track of state. Later on, someone else told me that The Right Way to use H::P is to subclass it, and extend the internal hash object to track your state that way. Fair enough, but this approach, while theoretically correct, is a non-trivial undertaking for a casual user who just wants to chop up some HTML.
More recently I used HTML::TreeBuilder to parse some HTML output from a webified Domino database. Because of the way the HTML was structured in this particular case, it was a snap to just look_down('_tag', 'foo') and get exactly what I wanted. It was a easy to write, and the code was straightforward.
Then last week, I got tired of keeping an eye on our farm of HP 4600 colour printers, to see how their supplies were lasting (they use four cartridges, C, M, Y & K, and two kits, the transfer and fuser). It turns out that this model has a embedded web server (I could have also done it with SNMP, but that's another story). Point your browser at it, and it will produce a status page that shows you how many pages can be printed out on what's left in the consumables.
So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.
After I'd managed to wrestle the data I wanted out of the web page, I set about stepping through the code in the debugger, to better understand the data structures and see what shortcuts I could by way of method chaining and array slicing in an attempt to tidy up the code. To my suprise, I saw that just building the H::TB object (by calling the parse() with the HTML in a scalar) required about a second to execute, and this, on some fairly recent high-end hardware.
Until then I wasn't really concerned about performance, because I figured the parse time would be dwarfed by the time it took to get the request's results back from the network. In the master plan, however, I intend to use LWP::ParallelAgent to probe all of the printers in parallel, rather than looping though them one at a time, and factor out much of the waiting. In a perfect world it would be as fast as the single slowest printer.
Given the less than stellar performance of the code at this point, however, it was clear that the cost of parsing the HTML would consume the bulk of the overall run-time. Maybe I might be able to traverse a partially-fetched page, but at this point the architecture would start to become unwieldy. Madness.
So, after having tried the orthodox approach, I started again and broke the rule about parsing HTML with regexps and wrote the following:
A single regexp (albeit with a /g modifier) pulls out all I want. Actually it's not quite perfect, since it the resulting array also fills up with a pile of undefs, the unfilled parens on the opposite side of the | alternation to the match. But even so there's a pattern to the offsets into the array where the defined values are found, so a few magic numbers wrapped up as constants tidy that up.
Is the code fragile? Not really. For a start, the HTML has errors in it, such as <td valign= op"> (although H::TB coped just fine with this too).
The generated HTML is stored in the printer's onboard firmware, so unless I upgrade the BIOS, the HTML isn't going to change; it's written in stone, bugs and all. Here's the main point: when the HP 4650 or 4700 model is released, it will probably have completely different HTML anyway, perhaps with style sheets instead of tables. Either way, the HTML will have to be inspected anew, in order to tweak the regexp, or to pull something else out of TreeBuilder's parse tree.
Thus neither approach is maintenance free. But the regexp far less code, and 17 times faster. Now the extraction cost is negligeable compared to the page fetch, as it should be. And as a final bonus the regexp approach requires no non-core modules. Case closed.
I intend to make this a module available on CPAN, so about the best I can do will to be to put a BUGS section in the POD, aaking for contributions of HTML page dumps from newer models, and I'll patch it to take account of the variations.
Now all I need is a suggestion of how to name the module...