Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

WWW::Mechanize::Polite ?

by jeffa (Bishop)
on Feb 22, 2004 at 02:58 UTC ( #330872=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Using URI::URL
in thread Using URI::URL

Kudos to you for wanting polite bots. The problem with getting LWP::RobotUA to play nice with WWW::Mechanize is that they both are subclasses of LWP::UserAgent. By itself, WWW::Mechanize does not consult the /robots.txt file, but you can instead use WWW::RobotRules. Here is a working example that tries to grab two files from my server:
use strict; use warnings; use WWW::Mechanize; use WWW::RobotRules; use LWP::Simple; my $SITE = 'http://www.unlocalhost.com'; my $rules = WWW::RobotRules->new('bot/1.0'); my $robot_url = "$SITE/robots.txt"; my $robot_data = LWP::Simple::get($robot_url); $rules->parse($robot_url, $robot_data) if $robot_data; for ('disallow.txt', 'allow.txt') { my $url = "$SITE/$_"; if($rules->allowed($url)) { my $mech = WWW::Mechanize->new; $mech->get($url); print "$url:\n", $mech->content; } else { print "$url:\ndenied\n"; } }
There might be a better way though ... ahh, how about "WWW::Mechanize::Polite"?
package WWW::Mechanize::Polite; use base 'WWW::Mechanize'; use WWW::RobotRules; sub new { my $self = shift->SUPER::new(@_); $self->{robo_rules} = WWW::RobotRules->new($self->agent()); return $self; } sub parse_robots { my ($self,$url) = @_; $self->get($url); $self->{robo_rules}->parse($url, $self->content); } sub polite_get { my ($self,$url) = @_; if ($self->{robo_rules}->allowed($url)) { $self->get($url); } else { undef $self->{content}; } } 1;
And a client script:
use WWW::Mechanize::Polite; my $SITE = 'http://www.unlocalhost.com'; my $mech = WWW::Mechanize::Polite->new; $mech->parse_robots("$SITE/robots.txt"); for ('allow.txt', 'disallow.txt') { my $url = "$SITE/$_"; $mech->polite_get($url); print "$url:\n", $mech->content ? $mech->content : "denied\n"; }
And if i didn't just reinvent a wheel, you might be seeing this on the CPAN. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re: WWW::Mechanize::Polite ?
by mkurtis (Scribe) on Feb 22, 2004 at 03:14 UTC
    thanks jeffa, you rock!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://330872]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2020-03-30 01:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    To "Disagree to disagree" means to:









    Results (172 votes). Check out past polls.

    Notices?