Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Passing timeout params through WWW::RobotRules

by perlpreben (Beadle)
on Apr 17, 2010 at 14:11 UTC ( #835254=perlquestion: print w/replies, xml ) Need Help??

perlpreben has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I've always had a hard time figuring out how you pass values to a module, when your using a module that depend on others that you need to tincker with.

An example is now when im playing around with WWW::RobotRules.
This module uses LWP::Simple for its requests (i think), and I want to pass a timeout => '2' to LWP::Simple, but all I have in my code is things evolving around the main module, robotrules.

Example;
require WWW::RobotRules; my $user_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"; my $robotsrules = new WWW::RobotRules '$user_agent'; ..... $robotsrules->parse($url, $robots_txt); if($robotsrules->allowed($url)){ }

Just to make it clear, I know how to read the docs for LWP::Simple ;), ... its jut a matter of how to I access this layer, since robotrules has precidence.

Replies are listed 'Best First'.
Re: Passing timeout params through WWW::RobotRules
by almut (Canon) on Apr 17, 2010 at 15:49 UTC

    What is fetching your $robots_txt file?  As far as I understand, WWW::RobotRules does not fetch files itself, but rather expects you to use some user agent like LWP::Simple or LWP::UseAgent for that...  And with the latter, you could then just call the timeout method on the user agent object, e.g.

    use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(2); ...

    or, when you're using LWP::Simple, you have to import $ua to get access to its internal LWP::UserAgent object:

    use LWP::Simple qw(get $ua); $ua->timeout(2); ...

    Or better yet, simply use LWP::RobotUA in the first place, which should provide the same timeout method, as it's a sub-class of LWP::UserAgent.  Something like this (if I'm reading the docs correctly — i.e. untested):

    use LWP::RobotUA; use WWW::RobotRules; my $ua = LWP::RobotUA->new(...); my $rules = WWW::RobotRules->new(...); $ua->rules($rules); # optional - defaults used otherwise $ua->timeout(2); my $response = $ua->get(...); ...
Re: Passing timeout params through WWW::RobotRules
by trwww (Priest) on Apr 17, 2010 at 16:12 UTC

    EDIT: almut demonstrates above that there is a UA subclass that respects robots.txt files. That is probably what you're looking for.

    I've always had a hard time figuring out how you pass values to a module, when your using a module that depend on others that you need to tincker with.

    An example is now when im playing around with WWW::RobotRules.

    The latter is not an example of the former in this case. WWW::RobotRules does not depend on another module at all. The synopsis just shows one way that a person could fetch the source of a robots.txt file to feed to WWW::RobotRules. To be clear: there is no dependency there. It is an example of one possible way to do it.

    This module uses LWP::Simple for its requests (i think)

    I am unable to determine what leads you to think this, as it is not the case at all. It is clear in the synopsis that LWP::Simple is used to fetch data to feed to WWW::RobotRules, and that WWW::RobotRules provides no interface to fetch a network based resource for you. Your example is incomplete. You have not shown how you define $robots_txt.

    Just to make it clear, I know how to read the docs for LWP::Simple ;), ... its jut a matter of how to I access this layer, since robotrules has precidence.

    You've really confused yourself about exactly what your problem is because you think there are relationships in your code when there actually are none. Your real problem is the synopsis of WWW::RobotRules uses LWP::Simple to fetch the robots.txt file, and you don't know how to rewrite that part to enable features of the LWP module that you need. Here is the code that you are looking for:

    use LWP::UserAgent; use WWW::RobotRules; my $rules = WWW::RobotRules->new('MOMspider/1.0'); my $ua = LWP::UserAgent->new; $ua->timeout(10); my $robots_url = 'http://some.place/robots.txt'; my $response = $ua->get($robots_url); if ($response->is_success) { my $robots_txt = $response->decoded_content; $rules->parse($robots_url, $robots_txt); if( $rules->allowed($url) ) { ... } } else { die "cant fetch $robots_url: " . $response->status_line; }
      Ahh, I truly missunderstood then. But that makes everything very clear. I can use mechanize then (since its the one im using on the other parts. Thank you so much for making this clear for me :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://835254]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2019-07-17 02:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?