http://www.perlmonks.org?node_id=835261


in reply to Passing timeout params through WWW::RobotRules

EDIT: almut demonstrates above that there is a UA subclass that respects robots.txt files. That is probably what you're looking for.

I've always had a hard time figuring out how you pass values to a module, when your using a module that depend on others that you need to tincker with.

An example is now when im playing around with WWW::RobotRules.

The latter is not an example of the former in this case. WWW::RobotRules does not depend on another module at all. The synopsis just shows one way that a person could fetch the source of a robots.txt file to feed to WWW::RobotRules. To be clear: there is no dependency there. It is an example of one possible way to do it.

This module uses LWP::Simple for its requests (i think)

I am unable to determine what leads you to think this, as it is not the case at all. It is clear in the synopsis that LWP::Simple is used to fetch data to feed to WWW::RobotRules, and that WWW::RobotRules provides no interface to fetch a network based resource for you. Your example is incomplete. You have not shown how you define $robots_txt.

Just to make it clear, I know how to read the docs for LWP::Simple ;), ... its jut a matter of how to I access this layer, since robotrules has precidence.

You've really confused yourself about exactly what your problem is because you think there are relationships in your code when there actually are none. Your real problem is the synopsis of WWW::RobotRules uses LWP::Simple to fetch the robots.txt file, and you don't know how to rewrite that part to enable features of the LWP module that you need. Here is the code that you are looking for:

use LWP::UserAgent; use WWW::RobotRules; my $rules = WWW::RobotRules->new('MOMspider/1.0'); my $ua = LWP::UserAgent->new; $ua->timeout(10); my $robots_url = 'http://some.place/robots.txt'; my $response = $ua->get($robots_url); if ($response->is_success) { my $robots_txt = $response->decoded_content; $rules->parse($robots_url, $robots_txt); if( $rules->allowed($url) ) { ... } } else { die "cant fetch $robots_url: " . $response->status_line; }