Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Re:x2 Scraping HTML: orthodoxy and reality

by mojotoad (Monsignor)
on Jul 08, 2003 at 19:54 UTC ( #272426=note: print w/replies, xml ) Need Help??

in reply to Re:x2 Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Here's a quick example, just to give you an idea. I apologize for the crufty code.

This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.


#!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get(''); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.*:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()})); }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://272426]
LanX and: = harder to reproduce. ..
[LanX]: sorry, i didn't break math yet, but it's top priority on my todo list
[choroba]: does your list start with number 5?
[LanX]: -i
LanX it's an imaginary list ...
LanX ... reflecting root problems
choroba uses real lists so he can easily insert as many items in between as needed

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (13)
As of 2018-03-20 10:24 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (249 votes). Check out past polls.