Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: [OT] Ethical and Legal Screen Scraping

by tilly (Archbishop)
on Jul 25, 2005 at 21:33 UTC ( #477983=note: print w/replies, xml ) Need Help??


in reply to Re: [OT] Ethical and Legal Screen Scraping
in thread [OT] Ethical and Legal Screen Scraping

You seem to be confused about ethics.

Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically. If you do something unethical, it is unethical whether or not you're caught. If you find yourself having to make excuses for your behaviour, the odds are very good that you're behaving unethically.

If you write a program to scrape another website, even if it is just for personal use, courtesy and ethics says that you should pay attention to robots.txt. Sure, you can do by hand anything that you automate. But when you automate you're likely to do a lot more of it than when you do it by hand. And you're likely to do it a lot faster. This has implications for the website that you're visiting, and it makes sense that website operators would ask you to be particularly polite to them.

If you say, "Oh, this is just for personal use" and turn a poorly written spider loose on a site, you're being rude and unethical. The website operator may well choose to repay your rudeness in kind and block you. They don't even have to go to court to do it either - they just notice that you're a bandwidth hog and lock you out.

But you asked several hypothetical questions. Here are not so hypothetical answers. Someone might recognize that you didn't just use your browser because of the speed with which you hit the site, because of your user agent, because they get access to your computer and find the program that you used to do it. There are other things that might strike them as suspicious, but the above is a good starting list.

  • Comment on Re^2: [OT] Ethical and Legal Screen Scraping

Replies are listed 'Best First'.
Re^3: [OT] Ethical and Legal Screen Scraping (and courtesy)
by Tanktalus (Canon) on Jul 25, 2005 at 22:38 UTC

    The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.

    First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.

    Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.

    Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})

    Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).

      Your flip comment about relativist ethics hits a sore point for me. Moral relativism does not absolve you from having a system of morals. And if you really believe it does, then depending on details you're a psychopath or a sociopath and I'd prefer that you be a long ways away from me (preferably in jail). For more on that read this post by me on another site explaining my views in more detail.

      That said, I admit that different people will have different views of what is or is not ethical. And, as your examples illustrate, there are plenty of uses of an automated agent that most (including me) would agree justify ignoring robots.txt. But you have to think about what you're doing and why.

      However at least one of your examples is questionable. Suppose that you're downloading 1-10 MB in short spans of time, and the machine that you're hitting is a public webserver hosted on someone's personal machine. A machine whose bandwidth is no better than your own. While you might not think that that's an issue, the webserver operator may not agree. Nor may other users. Nor may other people who are hosted on that machine. This applies for both personal webservers and also many small businesses. You may not know enough to determine whether this is an issue - but the website owner's opinion is right there in robots.txt.

      Suppose, to take another example of yours, that you are scraping a pay-only portion of a site for personal use. Even though you believe that your use is ethical, the website owner has no way of knowing that. The website owner may or may not have had you click through an agreement that you won't use automated bots. Now what you're doing may be illegal (you are violating a contract) and is of questionable morality despite your justifications (you are breaking your word). Furthermore you run a real risk of having the website owner notice what you're doing and block you. (Whether or not there is an agreement.)

      This doesn't just happen at pay sites. I know of multiple people who have found themselves shut out of Google after testing a bot there.

      Now I'll agree that most websites don't monitor things that closely. As a practical matter, plenty of people ignore robots.txt and don't get caught. But I still think that if you are writing a bot, then you should either pay attention to robots.txt or have a good reason not to.

        My "flip" comment on relativist ethics was, I think, completely misunderstood because you seem to have a very large sore spot surrounding the issue. The point is that if you have "relative" ethics (everyone is entitled to their own ethics and are not more or less "good" than the ethics of others), there is no common measuring stick that you and I can both apply to the situation at hand. Since we are even having a common discussion, I have to assume absolutist ethics (one thing may be considered "more ethical" than another, regardless of who does it - this doesn't mean that we necessarily agree on what is more ethical than something else, just the concept that there is such a measuring stick), otherwise there is no discussion. I merely pointed out this assumption because if you wanted to disagree on which framework to surround the discussion, I would show that I've actually thought about the differences and we could have a frank, but brief, discussion on that point.

        I do not believe I have mischaracterised ethical relativism in any way. I pointed out the one requirement is that your belief is "sincere" (i.e., you're not just making it up to suit the situation). Of course, as with the rest of relativistic philosophy, the definition of "sincere" is still a bit blury, thus the quotes.

        Back to the question at hand. You continue your misunderstanding by questioning my "courtesy" example with an example which I have explicitly taken care of with the last sentence in the post you are replying to. If that machine is a public webserver hosted on someone's personal machine, I would argue that this is the scenario where you "need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-})."

        As to the website owner who decides to block you from access that you're paying for. I want to point out that I treated the three questions, legal, ethical, and courtesy, independantly. Just because something may be ethical doesn't mean it's legal, or vice versa. Just because something may be done out of courtesy does not mean that ethics demands it, nor that the law does. So I may believe it's ethical to scrape the website for the data you've paid for, but the legal contract may say otherwise. And that's completely valid description of reality (whether or not I agree that the contract is fair, i.e., ethical).

        Service providers provide their service under certain conditions which they are legally entitled to enforce. However, if what they are blocking would fall under "fair use" style common law, they may find their customers leaving for providers who provide more lenient access.

        Google gets away with what they're doing by having a free registration system for API access where they can ensure a quality of service to everyone simultaneously. It seems eminently fair, and far-sighted, to me. However, if CNN blocked me from minor amounts of scraping (where I define minor to be in the "ethical" example above), I would simply find another news provider. Say CBC.ca or news.yahoo.com or maybe something on Google. The choice to serve is equal to the choice to find a new provider of service ;-)

        You're introducing another concept here, morality. Which is definitively not ethics.

        Without any intention to start any flame, I only urge to tell you that I've been stuck by this seemingly casual consideration:

        preferably in jail
        Are you saying that you would prefer a person to be in jail only for not having a moral system? If so, this mines the principle of one being innocent until his guiltiness is proven - something that someone from Brazil would not appreciate too much in these days, if we only could still talk to him. And he's only one drop in the ocean.

        If I was wrong reading you (and I really hope I was), please accept my excuses in advance.

        Flavio
        perl -ple'$_=reverse' <<<ti.xittelop@oivalf

        Don't fool yourself.
Re^3: [OT] Ethical and Legal Screen Scraping
by wazoox (Prior) on Jul 26, 2005 at 09:46 UTC
    You seem to be confused about ethics. Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically.

    Looks like you didn't understand me. You're confusing the "ethic" aspect of my answer with the "legal" aspect. I mean "enforcing legally the respect of robots.txt is abusive and unethical.". I mean too "I can't imagine that you'd be legally in trouble for web-spidering, for personal use, without taking care of robots.txt, except in China or Iran.". I never meant "Go suck all websites you want to and don't bother", ever.

    I think too the generalisation "if you use a web-spidering program without taking notice of robots.txt, you're ethically wrong", is questionable. Ethics aren't that simple. Perhaps he's a political refugee trying to extract important information from some hostile hidden website. Is he "right" or "wrong"?

      I had misunderstood that you were saying that the law itself is unethical. I think that distinction was probably lost on many others as well.

      However, as I think the subsequent discussion clarified, my views are not as simple as saying that thou shalt always do what robots.txt says. (Paying attention to it does not mean that you necessarily agree or do what it says...)

        Well, I probably wasn't clear enough myself. And fortunately it's not yet exactly a law but a jurisprudence AFAIK, which is somewhat better (if a law comes someday to draw the line...)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://477983]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2018-01-17 22:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (206 votes). Check out past polls.

    Notices?