Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.

There are 6 printers today, and we'll probably be adding another 4 or so in the future.

As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

#! /usr/bin/perl -w use strict; use LWP::UserAgent; my @cartridge = qw/ name part percent remain coverage low serial print +ed /; my @kit = qw/ name part percent remain /; for my $host( @ARGV ) { my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm +l&cat=0&pos=2}; my $response = LWP::UserAgent->new->request( HTTP::Request->new( G +ET => $url )); if( !$response->is_success ) { warn "$host: couldn't get $url: ", $response->status_line, "\n +"; next; } $_ = $response->content; my (@s) = grep { defined $_ } m{ (?: > # closing tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C97 +24A/) </font>\s+</td>\s*<td[^>]+><font[^>]+> (\d+) # percent remaining ) | (?: (?: (?: Pages\sRemaining # different text values | Low\sReached | Serial\sNumber | Pages\sprinted\swith\sthis\ssupply ) : \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font +[^>]*>\s* # separated by this | Based\son\shistorical\s\S+\spage\scoverage\sof\s # or +just this, within a <td> ) (\w+) # and the value we want ) }gx; my %res; @{$res{K}}{@cartridge} = @s[ 0.. 7]; @{$res{X}}{@kit} = @s[ 8..11]; @{$res{C}}{@cartridge} = @s[12..19]; @{$res{F}}{@kit} = @s[20..23]; @{$res{M}}{@cartridge} = @s[24..31]; @{$res{Y}}{@cartridge} = @s[32..39]; print <<END_STATS; $host Xfer $res{X}{percent}% $res{X}{remain} Fuse $res{F}{percent}% $res{F}{remain} C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} +printed=$res{C}{printed} M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} +printed=$res{M}{printed} Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} +printed=$res{Y}{printed} K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} +printed=$res{K}{printed} END_STATS }

Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

In reply to Re:x2 Scraping HTML: orthodoxy and reality by grinder
in thread Scraping HTML: orthodoxy and reality by grinder

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.