(OT) Robots disallow

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: (OT) Robots disallow by marto (Cardinal) on Apr 01, 2009 at 15:05 UTC
Hmm, no mention of Perl in your question, marking a post (OT) would be nice :) I'd suggest reading http://www.robotstxt.org/, and How do I prevent robots scanning my site? which details your current solution with the caveat: "but this only helps with well-behaved robots." and a reference to Can I block just bad robots? Hope this helps, Martin	[reply]
Re: (OT) Robots disallow by kennethk (Abbot) on Apr 01, 2009 at 15:03 UTC
You can't. There is no possible way for a server to tell the difference between a robot and a human user. The Robots_exclusion_standard is purely voluntary for robot scripters, and you have already followed the protocol.	[reply]
Re^2: (OT) Robots disallow by jhourcle (Prior) on Apr 01, 2009 at 18:23 UTC
If there's a specific signature of the bot that you're having problems with (a specific domain, or an identifiable user-agent), look into configuring your webserver to send them alternative content. Back when it was easy to identify e-mail harvesters, I'd send them to a CGI that slowly feeds them bogus e-mail addresses and the abuse address from the netblock they're coming from. These days, most harvesters are coming from botnets, so the abuse one isn't so useful. (and yes, it _is_ a Perl script)	[reply]
Re: (OT) Robots disallow by CountZero (Bishop) on Apr 01, 2009 at 15:06 UTC
The robots or crawlers are free to fully disregard the robots.txt directives. Certainly that is not nice, but the world is full of less than nice people (and robots and crawlers and ...) I would not care much of this. If you do not want to know the world about the info on your web-site, then don't publish it where everyone can see it or put it behind a password protection. Some more info can be found at The Web Robots pages. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: (OT) Robots disallow by ikegami (Patriarch) on Apr 01, 2009 at 16:24 UTC
The robots or crawlers are free to fully disregard the robots.txt directives. Certainly that is not nice, but the world is full of less than nice people (and robots and crawlers and ...) And conversely, there are less than nice web sites blocking robots for no reason.	[reply]
Re: (OT) Robots disallow by kyle (Abbot) on Apr 01, 2009 at 15:24 UTC
As others have noted, you're already doing the Right Thing to rid yourself of robots that play by the rules. You might be able to discourage badly behaved robots by creating a tar pit for it to wander into. The problems with this are many, and I'm not inclined to discuss them in any detail, but if you're in desperate times and looking for desperate measures, it's an idea worth consideration and probably rejection.	[reply]
Re: (OT) Robots disallow by VinsWorldcom (Prior) on Apr 01, 2009 at 15:16 UTC
<sarcasm on> Your code was a bit off. I updated to this: `#!/usr/bin/perl use strict; open (OUT, ">robots.txt") \|\| die "Cannot open robots.txt\n"; print OUT "User-agent: *\n"; print OUT "Disallow: /\n"; close OUT;` [download] <sarcasm off>	[reply] [d/l]


Clear questions and runnable code get the best and fastest answer
	PerlMonks