Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: Re:x2 Scraping HTML: orthodoxy and reality

by mojotoad (Monsignor)
on Jul 08, 2003 at 19:54 UTC ( #272426=note: print w/ replies, xml ) Need Help??

in reply to Re:x2 Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Here's a quick example, just to give you an idea. I apologize for the crufty code.

This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.


#!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get(''); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272426]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2016-07-24 21:08 GMT
Find Nodes?
    Voting Booth?
    What is your favorite alternate name for a (specific) keyboard key?

    Results (221 votes). Check out past polls.