Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^3: [OT] Ethical and Legal Screen Scraping (and courtesy)

by Tanktalus (Canon)
on Jul 25, 2005 at 22:38 UTC ( #478004=note: print w/replies, xml ) Need Help??

in reply to Re^2: [OT] Ethical and Legal Screen Scraping
in thread [OT] Ethical and Legal Screen Scraping

The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.

First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.

Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.

Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})

Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).

  • Comment on Re^3: [OT] Ethical and Legal Screen Scraping (and courtesy)

Replies are listed 'Best First'.
Re^4: [OT] Ethical and Legal Screen Scraping (and courtesy)
by tilly (Archbishop) on Jul 25, 2005 at 23:01 UTC
    Your flip comment about relativist ethics hits a sore point for me. Moral relativism does not absolve you from having a system of morals. And if you really believe it does, then depending on details you're a psychopath or a sociopath and I'd prefer that you be a long ways away from me (preferably in jail). For more on that read this post by me on another site explaining my views in more detail.

    That said, I admit that different people will have different views of what is or is not ethical. And, as your examples illustrate, there are plenty of uses of an automated agent that most (including me) would agree justify ignoring robots.txt. But you have to think about what you're doing and why.

    However at least one of your examples is questionable. Suppose that you're downloading 1-10 MB in short spans of time, and the machine that you're hitting is a public webserver hosted on someone's personal machine. A machine whose bandwidth is no better than your own. While you might not think that that's an issue, the webserver operator may not agree. Nor may other users. Nor may other people who are hosted on that machine. This applies for both personal webservers and also many small businesses. You may not know enough to determine whether this is an issue - but the website owner's opinion is right there in robots.txt.

    Suppose, to take another example of yours, that you are scraping a pay-only portion of a site for personal use. Even though you believe that your use is ethical, the website owner has no way of knowing that. The website owner may or may not have had you click through an agreement that you won't use automated bots. Now what you're doing may be illegal (you are violating a contract) and is of questionable morality despite your justifications (you are breaking your word). Furthermore you run a real risk of having the website owner notice what you're doing and block you. (Whether or not there is an agreement.)

    This doesn't just happen at pay sites. I know of multiple people who have found themselves shut out of Google after testing a bot there.

    Now I'll agree that most websites don't monitor things that closely. As a practical matter, plenty of people ignore robots.txt and don't get caught. But I still think that if you are writing a bot, then you should either pay attention to robots.txt or have a good reason not to.

      My "flip" comment on relativist ethics was, I think, completely misunderstood because you seem to have a very large sore spot surrounding the issue. The point is that if you have "relative" ethics (everyone is entitled to their own ethics and are not more or less "good" than the ethics of others), there is no common measuring stick that you and I can both apply to the situation at hand. Since we are even having a common discussion, I have to assume absolutist ethics (one thing may be considered "more ethical" than another, regardless of who does it - this doesn't mean that we necessarily agree on what is more ethical than something else, just the concept that there is such a measuring stick), otherwise there is no discussion. I merely pointed out this assumption because if you wanted to disagree on which framework to surround the discussion, I would show that I've actually thought about the differences and we could have a frank, but brief, discussion on that point.

      I do not believe I have mischaracterised ethical relativism in any way. I pointed out the one requirement is that your belief is "sincere" (i.e., you're not just making it up to suit the situation). Of course, as with the rest of relativistic philosophy, the definition of "sincere" is still a bit blury, thus the quotes.

      Back to the question at hand. You continue your misunderstanding by questioning my "courtesy" example with an example which I have explicitly taken care of with the last sentence in the post you are replying to. If that machine is a public webserver hosted on someone's personal machine, I would argue that this is the scenario where you "need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-})."

      As to the website owner who decides to block you from access that you're paying for. I want to point out that I treated the three questions, legal, ethical, and courtesy, independantly. Just because something may be ethical doesn't mean it's legal, or vice versa. Just because something may be done out of courtesy does not mean that ethics demands it, nor that the law does. So I may believe it's ethical to scrape the website for the data you've paid for, but the legal contract may say otherwise. And that's completely valid description of reality (whether or not I agree that the contract is fair, i.e., ethical).

      Service providers provide their service under certain conditions which they are legally entitled to enforce. However, if what they are blocking would fall under "fair use" style common law, they may find their customers leaving for providers who provide more lenient access.

      Google gets away with what they're doing by having a free registration system for API access where they can ensure a quality of service to everyone simultaneously. It seems eminently fair, and far-sighted, to me. However, if CNN blocked me from minor amounts of scraping (where I define minor to be in the "ethical" example above), I would simply find another news provider. Say or or maybe something on Google. The choice to serve is equal to the choice to find a new provider of service ;-)

      You're introducing another concept here, morality. Which is definitively not ethics.

      Without any intention to start any flame, I only urge to tell you that I've been stuck by this seemingly casual consideration:

      preferably in jail
      Are you saying that you would prefer a person to be in jail only for not having a moral system? If so, this mines the principle of one being innocent until his guiltiness is proven - something that someone from Brazil would not appreciate too much in these days, if we only could still talk to him. And he's only one drop in the ocean.

      If I was wrong reading you (and I really hope I was), please accept my excuses in advance.

      perl -ple'$_=reverse' <<<ti.xittelop@oivalf

      Don't fool yourself.
        No. I would prefer that people be in jail for good cause.

        However if you actually have no moral system, then odds are good that you both have done things which you could be put in jail for, and routinely hurt (emotionally or physically) those around you. Which is why I'd prefer that you actually be caught and put in jail.

        The old term for someone with no moral system was "psychopath". When I say "no moral system" I mean a person who could lie, cheat, steal, rape, injure, maim or kill with no provocation or excuse, and who would feel no sense of guilt about having done any of those things. The only restraint on people like this is fear of being caught and punished. That restraint is important, most with this disorder do not actually go out and rape, injure, maim or kill random people. (Note that this list is shorter than the previous one...) But the deeds of some with the disorder has fixed a popular image of the psychopath as a crazy killer. Therefore the term "sociopath" is being used these days instead. The full-blown disorder is thankfully rare, but one could always wish that it was rarer still.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://478004]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2020-10-27 06:29 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (256 votes). Check out past polls.