Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Kudos to you for wanting polite bots. The problem with getting LWP::RobotUA to play nice with WWW::Mechanize is that they both are subclasses of LWP::UserAgent. By itself, WWW::Mechanize does not consult the /robots.txt file, but you can instead use WWW::RobotRules. Here is a working example that tries to grab two files from my server:
use strict; use warnings; use WWW::Mechanize; use WWW::RobotRules; use LWP::Simple; my $SITE = ''; my $rules = WWW::RobotRules->new('bot/1.0'); my $robot_url = "$SITE/robots.txt"; my $robot_data = LWP::Simple::get($robot_url); $rules->parse($robot_url, $robot_data) if $robot_data; for ('disallow.txt', 'allow.txt') { my $url = "$SITE/$_"; if($rules->allowed($url)) { my $mech = WWW::Mechanize->new; $mech->get($url); print "$url:\n", $mech->content; } else { print "$url:\ndenied\n"; } }
There might be a better way though ... ahh, how about "WWW::Mechanize::Polite"?
package WWW::Mechanize::Polite; use base 'WWW::Mechanize'; use WWW::RobotRules; sub new { my $self = shift->SUPER::new(@_); $self->{robo_rules} = WWW::RobotRules->new($self->agent()); return $self; } sub parse_robots { my ($self,$url) = @_; $self->get($url); $self->{robo_rules}->parse($url, $self->content); } sub polite_get { my ($self,$url) = @_; if ($self->{robo_rules}->allowed($url)) { $self->get($url); } else { undef $self->{content}; } } 1;
And a client script:
use WWW::Mechanize::Polite; my $SITE = ''; my $mech = WWW::Mechanize::Polite->new; $mech->parse_robots("$SITE/robots.txt"); for ('allow.txt', 'disallow.txt') { my $url = "$SITE/$_"; $mech->polite_get($url); print "$url:\n", $mech->content ? $mech->content : "denied\n"; }
And if i didn't just reinvent a wheel, you might be seeing this on the CPAN. ;)


(the triplet paradiddle with high-hat)

In reply to WWW::Mechanize::Polite ? by jeffa
in thread Using URI::URL by mkurtis

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (5)
    As of 2020-05-25 17:20 GMT
    Find Nodes?
      Voting Booth?
      If programming languages were movie genres, Perl would be:

      Results (146 votes). Check out past polls.