Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

EDIT: almut demonstrates above that there is a UA subclass that respects robots.txt files. That is probably what you're looking for.

I've always had a hard time figuring out how you pass values to a module, when your using a module that depend on others that you need to tincker with.

An example is now when im playing around with WWW::RobotRules.

The latter is not an example of the former in this case. WWW::RobotRules does not depend on another module at all. The synopsis just shows one way that a person could fetch the source of a robots.txt file to feed to WWW::RobotRules. To be clear: there is no dependency there. It is an example of one possible way to do it.

This module uses LWP::Simple for its requests (i think)

I am unable to determine what leads you to think this, as it is not the case at all. It is clear in the synopsis that LWP::Simple is used to fetch data to feed to WWW::RobotRules, and that WWW::RobotRules provides no interface to fetch a network based resource for you. Your example is incomplete. You have not shown how you define $robots_txt.

Just to make it clear, I know how to read the docs for LWP::Simple ;), ... its jut a matter of how to I access this layer, since robotrules has precidence.

You've really confused yourself about exactly what your problem is because you think there are relationships in your code when there actually are none. Your real problem is the synopsis of WWW::RobotRules uses LWP::Simple to fetch the robots.txt file, and you don't know how to rewrite that part to enable features of the LWP module that you need. Here is the code that you are looking for:

use LWP::UserAgent; use WWW::RobotRules; my $rules = WWW::RobotRules->new('MOMspider/1.0'); my $ua = LWP::UserAgent->new; $ua->timeout(10); my $robots_url = ''; my $response = $ua->get($robots_url); if ($response->is_success) { my $robots_txt = $response->decoded_content; $rules->parse($robots_url, $robots_txt); if( $rules->allowed($url) ) { ... } } else { die "cant fetch $robots_url: " . $response->status_line; }

In reply to Re: Passing timeout params through WWW::RobotRules by trwww
in thread Passing timeout params through WWW::RobotRules by perlpreben

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (6)
    As of 2020-08-09 09:01 GMT
    Find Nodes?
      Voting Booth?
      Which rocket would you take to Mars?

      Results (54 votes). Check out past polls.