- Obey robots.txt.
- Don't flood a site.
- Don't republish, especially not anything that might be copyrighted.
- Abide by the site terms of service.
I was further interested to learn in Chip Salzenberg's letter at geeksunite that: "Federal courts have upheld that web spiders must obey the established robots.txt mechanism by which web site owners limit automated access and that a failure to obey robots.txt constitutes trespass".
However, I'm confused about who robots.txt is intended for. I understand robots.txt applies to heavy duty web spiders and indexers, such as a Google robot. But does it also apply to little screen scraping tools written by private individuals? For example, suppose I write a little tool using LWP::UserAgent or WWW::Mechanize (rather than LWP::RobotUA or WWW::Mechanize::Polite ?, say) that simply collects a number of web pages for me while I sleep. Is it illegal or unethical for such a scraper to ignore robots.txt?
If a commercial company sells a tool that allows non-programmer end users to write little screen scraping robots, is it unethical or illegal for such a product to not provide a mechanism to allow their end users to respect robots.txt?