As suggested by Abigail-II in Re: Web Robot,
a polite robot should:
- Obey robots.txt.
- Don't flood a site.
- Don't republish, especially not anything that might be copyrighted.
- Abide by the site terms of service.
I was further interested to learn in
Chip Salzenberg's letter at geeksunite
"Federal courts have upheld that web spiders must obey
the established robots.txt mechanism by which web site owners
limit automated access and that a failure to obey robots.txt
However, I'm confused about who robots.txt is intended for.
I understand robots.txt applies to heavy duty web spiders and
indexers, such as a Google robot. But does it also apply to
little screen scraping tools written by private individuals?
For example, suppose I write a little tool using
(rather than LWP::RobotUA or WWW::Mechanize::Polite ?, say) that simply collects
a number of web pages for me while I sleep.
Is it illegal or unethical for such a scraper to ignore robots.txt?
If a commercial company sells a tool that allows non-programmer
end users to write little screen scraping robots, is it unethical
or illegal for such a product to not provide a mechanism to allow
their end users to respect robots.txt?