Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Re:x2 Scraping HTML: orthodoxy and reality

by mojotoad (Monsignor)
on Jul 08, 2003 at 19:54 UTC ( #272426=note: print w/ replies, xml ) Need Help??


in reply to Re:x2 Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Here's a quick example, just to give you an idea. I apologize for the crufty code.

This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.

Enjoy,
Matt

#!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get('http://grinder.perlmonk.org/hp4600/'); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }


Comment on Re: Re:x2 Scraping HTML: orthodoxy and reality
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272426]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-10-02 06:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (49 votes), past polls