in reply to Re: Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page:

There are 6 printers today, and we'll probably be adding another 4 or so in the future.

As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

#! /usr/bin/perl -w use strict; use LWP::UserAgent; my @cartridge = qw/ name part percent remain coverage low serial print +ed /; my @kit = qw/ name part percent remain /; for my $host( @ARGV ) { my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm +l&cat=0&pos=2}; my $response = LWP::UserAgent->new->request( HTTP::Request->new( G +ET => $url )); if( !$response->is_success ) { warn "$host: couldn't get $url: ", $response->status_line, "\n +"; next; } $_ = $response->content; my (@s) = grep { defined $_ } m{ (?: > # closing tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C97 +24A/) </font>\s+</td>\s*<td[^>]+><font[^>]+> (\d+) # percent remaining ) | (?: (?: (?: Pages\sRemaining # different text values | Low\sReached | Serial\sNumber | Pages\sprinted\swith\sthis\ssupply ) : \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font +[^>]*>\s* # separated by this | Based\son\shistorical\s\S+\spage\scoverage\sof\s # or +just this, within a <td> ) (\w+) # and the value we want ) }gx; my %res; @{$res{K}}{@cartridge} = @s[ 0.. 7]; @{$res{X}}{@kit} = @s[ 8..11]; @{$res{C}}{@cartridge} = @s[12..19]; @{$res{F}}{@kit} = @s[20..23]; @{$res{M}}{@cartridge} = @s[24..31]; @{$res{Y}}{@cartridge} = @s[32..39]; print <<END_STATS; $host Xfer $res{X}{percent}% $res{X}{remain} Fuse $res{F}{percent}% $res{F}{remain} C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} +printed=$res{C}{printed} M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} +printed=$res{M}{printed} Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} +printed=$res{Y}{printed} K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} +printed=$res{K}{printed} END_STATS }

Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

Replies are listed 'Best First'.
Re: Re:x2 Scraping HTML: orthodoxy and reality
by mojotoad (Monsignor) on Jul 08, 2003 at 15:54 UTC
    Here's a quick example, just to give you an idea. I apologize for the crufty code.

    This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

    HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

    I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.


    #!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get(''); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }