The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.
First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.
Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.
Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})
Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||