Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

WWW::Mechanize::Polite ?

by jeffa (Bishop)
on Feb 22, 2004 at 02:58 UTC ( [id://330872]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Using URI::URL
in thread Using URI::URL

Kudos to you for wanting polite bots. The problem with getting LWP::RobotUA to play nice with WWW::Mechanize is that they both are subclasses of LWP::UserAgent. By itself, WWW::Mechanize does not consult the /robots.txt file, but you can instead use WWW::RobotRules. Here is a working example that tries to grab two files from my server:
use strict; use warnings; use WWW::Mechanize; use WWW::RobotRules; use LWP::Simple; my $SITE = 'http://www.unlocalhost.com'; my $rules = WWW::RobotRules->new('bot/1.0'); my $robot_url = "$SITE/robots.txt"; my $robot_data = LWP::Simple::get($robot_url); $rules->parse($robot_url, $robot_data) if $robot_data; for ('disallow.txt', 'allow.txt') { my $url = "$SITE/$_"; if($rules->allowed($url)) { my $mech = WWW::Mechanize->new; $mech->get($url); print "$url:\n", $mech->content; } else { print "$url:\ndenied\n"; } }
There might be a better way though ... ahh, how about "WWW::Mechanize::Polite"?
package WWW::Mechanize::Polite; use base 'WWW::Mechanize'; use WWW::RobotRules; sub new { my $self = shift->SUPER::new(@_); $self->{robo_rules} = WWW::RobotRules->new($self->agent()); return $self; } sub parse_robots { my ($self,$url) = @_; $self->get($url); $self->{robo_rules}->parse($url, $self->content); } sub polite_get { my ($self,$url) = @_; if ($self->{robo_rules}->allowed($url)) { $self->get($url); } else { undef $self->{content}; } } 1;
And a client script:
use WWW::Mechanize::Polite; my $SITE = 'http://www.unlocalhost.com'; my $mech = WWW::Mechanize::Polite->new; $mech->parse_robots("$SITE/robots.txt"); for ('allow.txt', 'disallow.txt') { my $url = "$SITE/$_"; $mech->polite_get($url); print "$url:\n", $mech->content ? $mech->content : "denied\n"; }
And if i didn't just reinvent a wheel, you might be seeing this on the CPAN. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re: WWW::Mechanize::Polite ?
by mkurtis (Scribe) on Feb 22, 2004 at 03:14 UTC
    thanks jeffa, you rock!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://330872]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2025-06-16 01:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.