Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

(OT) Robots disallow

by Anonymous Monk
on Apr 01, 2009 at 14:58 UTC ( [id://754725]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How to Disallow bots or crawler not to access the server.
I have already added the following in robots.txt file
User-agent: * Disallow: /

2009-04-01 Retitled by GrandFather, as per Monastery guidelines

Replies are listed 'Best First'.
Re: (OT) Robots disallow
by marto (Cardinal) on Apr 01, 2009 at 15:05 UTC
Re: (OT) Robots disallow
by kennethk (Abbot) on Apr 01, 2009 at 15:03 UTC
    You can't. There is no possible way for a server to tell the difference between a robot and a human user. The Robots_exclusion_standard is purely voluntary for robot scripters, and you have already followed the protocol.

      If there's a specific signature of the bot that you're having problems with (a specific domain, or an identifiable user-agent), look into configuring your webserver to send them alternative content.

      Back when it was easy to identify e-mail harvesters, I'd send them to a CGI that slowly feeds them bogus e-mail addresses and the abuse address from the netblock they're coming from. These days, most harvesters are coming from botnets, so the abuse one isn't so useful. (and yes, it _is_ a Perl script)

Re: (OT) Robots disallow
by CountZero (Bishop) on Apr 01, 2009 at 15:06 UTC
    The robots or crawlers are free to fully disregard the robots.txt directives. Certainly that is not nice, but the world is full of less than nice people (and robots and crawlers and ...)

    I would not care much of this. If you do not want to know the world about the info on your web-site, then don't publish it where everyone can see it or put it behind a password protection.

    Some more info can be found at The Web Robots pages.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      The robots or crawlers are free to fully disregard the robots.txt directives. Certainly that is not nice, but the world is full of less than nice people (and robots and crawlers and ...)

      And conversely, there are less than nice web sites blocking robots for no reason.

Re: (OT) Robots disallow
by kyle (Abbot) on Apr 01, 2009 at 15:24 UTC

    As others have noted, you're already doing the Right Thing to rid yourself of robots that play by the rules. You might be able to discourage badly behaved robots by creating a tar pit for it to wander into. The problems with this are many, and I'm not inclined to discuss them in any detail, but if you're in desperate times and looking for desperate measures, it's an idea worth consideration and probably rejection.

Re: (OT) Robots disallow
by VinsWorldcom (Prior) on Apr 01, 2009 at 15:16 UTC
    <sarcasm on>

    Your code was a bit off. I updated to this:

    #!/usr/bin/perl use strict; open (OUT, ">robots.txt") || die "Cannot open robots.txt\n"; print OUT "User-agent: *\n"; print OUT "Disallow: /\n"; close OUT;
    <sarcasm off>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://754725]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-24 09:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found