Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Here's a quick example, just to give you an idea. I apologize for the crufty code.

This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.

Enjoy,
Matt

#!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get('http://grinder.perlmonk.org/hp4600/'); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }

In reply to Re: Re:x2 Scraping HTML: orthodoxy and reality by mojotoad
in thread Scraping HTML: orthodoxy and reality by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (7)
    As of 2014-10-26 07:17 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (152 votes), past polls