Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.

There are 6 printers today, and we'll probably be adding another 4 or so in the future.

As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

#! /usr/bin/perl -w use strict; use LWP::UserAgent; my @cartridge = qw/ name part percent remain coverage low serial print +ed /; my @kit = qw/ name part percent remain /; for my $host( @ARGV ) { my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm +l&cat=0&pos=2}; my $response = LWP::UserAgent->new->request( HTTP::Request->new( G +ET => $url )); if( !$response->is_success ) { warn "$host: couldn't get $url: ", $response->status_line, "\n +"; next; } $_ = $response->content; my (@s) = grep { defined $_ } m{ (?: > # closing tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C97 +24A/) </font>\s+</td>\s*<td[^>]+><font[^>]+> (\d+) # percent remaining ) | (?: (?: (?: Pages\sRemaining # different text values | Low\sReached | Serial\sNumber | Pages\sprinted\swith\sthis\ssupply ) : \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font +[^>]*>\s* # separated by this | Based\son\shistorical\s\S+\spage\scoverage\sof\s # or +just this, within a <td> ) (\w+) # and the value we want ) }gx; my %res; @{$res{K}}{@cartridge} = @s[ 0.. 7]; @{$res{X}}{@kit} = @s[ 8..11]; @{$res{C}}{@cartridge} = @s[12..19]; @{$res{F}}{@kit} = @s[20..23]; @{$res{M}}{@cartridge} = @s[24..31]; @{$res{Y}}{@cartridge} = @s[32..39]; print <<END_STATS; $host Xfer $res{X}{percent}% $res{X}{remain} Fuse $res{F}{percent}% $res{F}{remain} C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} +printed=$res{C}{printed} M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} +printed=$res{M}{printed} Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} +printed=$res{Y}{printed} K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} +printed=$res{K}{printed} END_STATS }

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.


In reply to Re:x2 Scraping HTML: orthodoxy and reality by grinder
in thread Scraping HTML: orthodoxy and reality by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (4)
    As of 2014-09-15 03:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite cookbook is:










      Results (145 votes), past polls