in reply to Re^2: [OT] Ethical and Legal Screen Scraping
in thread [OT] Ethical and Legal Screen Scraping

The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.

First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.

Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.

Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})

Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).