Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: [OT] Ethical and Legal Screen Scraping

by wazoox (Prior)
on Jul 25, 2005 at 14:33 UTC ( #477837=note: print w/replies, xml ) Need Help??


in reply to [OT] Ethical and Legal Screen Scraping

Is it illegal or unethical for such a scraper to ignore robots.txt?

First, I'd affirm strongly that illegal is definetely not equivalent to unethical. Actually, the new federal rule which demands that web spiders obey robots.txt may be legal, but it seems unethical to me.
robots.txt as I understand it isn't in any manner an access control system. Declaring it so in a legal manner and enforcing it is plain nonsense from a sick justice going mad.
The RFC defining the robots.txt standard ( robots.txt RFC ) states it very clearly:

It is solely up to the visiting robot to consult this information and act accordingly. Blocking parts of the Web site regardless of a robot's compliance with this method are outside the scope of this memo.

Regarding your own personal web spider, I'd say : who may ever knows that you sucked up a site with it? How can someone prove that you didn't hit "ctrtl+S" in your browser while visiting the site? How can someone may forbid you to save a personal backup copy of a publicly available document? This doesn't make sense. Republishing content like google cache or archive.org do may be questionable, but you're definetely allowed to store an unmodified copy of a web site for your personal use, unless you're living in Iran or China.

  • Comment on Re: [OT] Ethical and Legal Screen Scraping

Replies are listed 'Best First'.
Re^2: [OT] Ethical and Legal Screen Scraping
by tilly (Archbishop) on Jul 25, 2005 at 21:33 UTC
    You seem to be confused about ethics.

    Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically. If you do something unethical, it is unethical whether or not you're caught. If you find yourself having to make excuses for your behaviour, the odds are very good that you're behaving unethically.

    If you write a program to scrape another website, even if it is just for personal use, courtesy and ethics says that you should pay attention to robots.txt. Sure, you can do by hand anything that you automate. But when you automate you're likely to do a lot more of it than when you do it by hand. And you're likely to do it a lot faster. This has implications for the website that you're visiting, and it makes sense that website operators would ask you to be particularly polite to them.

    If you say, "Oh, this is just for personal use" and turn a poorly written spider loose on a site, you're being rude and unethical. The website operator may well choose to repay your rudeness in kind and block you. They don't even have to go to court to do it either - they just notice that you're a bandwidth hog and lock you out.

    But you asked several hypothetical questions. Here are not so hypothetical answers. Someone might recognize that you didn't just use your browser because of the speed with which you hit the site, because of your user agent, because they get access to your computer and find the program that you used to do it. There are other things that might strike them as suspicious, but the above is a good starting list.

      The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.

      First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.

      Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.

      Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})

      Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).

        Your flip comment about relativist ethics hits a sore point for me. Moral relativism does not absolve you from having a system of morals. And if you really believe it does, then depending on details you're a psychopath or a sociopath and I'd prefer that you be a long ways away from me (preferably in jail). For more on that read this post by me on another site explaining my views in more detail.

        That said, I admit that different people will have different views of what is or is not ethical. And, as your examples illustrate, there are plenty of uses of an automated agent that most (including me) would agree justify ignoring robots.txt. But you have to think about what you're doing and why.

        However at least one of your examples is questionable. Suppose that you're downloading 1-10 MB in short spans of time, and the machine that you're hitting is a public webserver hosted on someone's personal machine. A machine whose bandwidth is no better than your own. While you might not think that that's an issue, the webserver operator may not agree. Nor may other users. Nor may other people who are hosted on that machine. This applies for both personal webservers and also many small businesses. You may not know enough to determine whether this is an issue - but the website owner's opinion is right there in robots.txt.

        Suppose, to take another example of yours, that you are scraping a pay-only portion of a site for personal use. Even though you believe that your use is ethical, the website owner has no way of knowing that. The website owner may or may not have had you click through an agreement that you won't use automated bots. Now what you're doing may be illegal (you are violating a contract) and is of questionable morality despite your justifications (you are breaking your word). Furthermore you run a real risk of having the website owner notice what you're doing and block you. (Whether or not there is an agreement.)

        This doesn't just happen at pay sites. I know of multiple people who have found themselves shut out of Google after testing a bot there.

        Now I'll agree that most websites don't monitor things that closely. As a practical matter, plenty of people ignore robots.txt and don't get caught. But I still think that if you are writing a bot, then you should either pay attention to robots.txt or have a good reason not to.

      You seem to be confused about ethics. Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically.

      Looks like you didn't understand me. You're confusing the "ethic" aspect of my answer with the "legal" aspect. I mean "enforcing legally the respect of robots.txt is abusive and unethical.". I mean too "I can't imagine that you'd be legally in trouble for web-spidering, for personal use, without taking care of robots.txt, except in China or Iran.". I never meant "Go suck all websites you want to and don't bother", ever.

      I think too the generalisation "if you use a web-spidering program without taking notice of robots.txt, you're ethically wrong", is questionable. Ethics aren't that simple. Perhaps he's a political refugee trying to extract important information from some hostile hidden website. Is he "right" or "wrong"?

        I had misunderstood that you were saying that the law itself is unethical. I think that distinction was probably lost on many others as well.

        However, as I think the subsequent discussion clarified, my views are not as simple as saying that thou shalt always do what robots.txt says. (Paying attention to it does not mean that you necessarily agree or do what it says...)

Re^2: [OT] Ethical and Legal Screen Scraping
by Your Mother (Bishop) on Jul 25, 2005 at 16:07 UTC
    How can someone may forbid you to save a personal backup copy of a publicly available document?

    This whole thing is a can of worms and it may not be sorted out clearly and legally for decades, however, copyright (in the US anyway) means the rights to all copies, which includes digital. So if a site's terms of service and copyright notice says you can't save a copy of the site for personal use, then you can't do it legally.

    Who's to stop you? Probably no one. But just because something is technically possible doesn't suddenly make it ethical. It would be technically trivial for me to shoot out my neighbor's windows at 3am with a pellet gun from 250 yards and be all but assured of getting away with it. Technology, ability, doesn't change morality.

    The more respect--and peer pressure for continued mutual respect--we have for these sorts of things, the more open it can all be and remain. The more the windows are broken, the more the government cops have excuses to climb in and "protect" us.

      Caching has been traditionally considered acceptible, legal, and ethical. Most web browsers do it on a small scale, and Google does it on a large scale. If the purpose of a scraper is to enable, for example, an "offline reader", there is no ethical dilemma.

      Even long-term caches that are not shared with others are not an ethical dilemma so long as the material on the web site being cached remains publicly available. Beyond that point, there is a dilemma, and one has to consider whether the good of continued access to that data outweighs the good of complying with the copyright-holder's wishes. I'd take that on a case-by-case basis.

      <-radiant.matrix->
      Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
      The Code that can be seen is not the true Code
      "In any sufficiently large group of people, most are idiots" - Kaa's Law
        Magnetic or optical caches on disk are just a backup for my memory. It is not a lapse to remember information past the point that someone wishes you to remember their published information and is also not a lapse to remember this with the aid of a recording of it either.

        Just because I kept a copy of my hard drive, CD, printout, or in some hand written notes doesn't mean I'm obligated under any ethical system I'm aware of to destroy them just because my original source stopped publishing. The copyright holder's wishes are irrelevant.

        Yep, I agree. Google in particular does exactly the right thing, I think. They allow you to control how they handle your content almost completely. You can exclude your pages from their caches while still including them in their search results.

      This whole thing is a can of worms and it may not be sorted out clearly and legally for decades, however, copyright (in the US anyway) means the rights to all copies, which includes digital. So if a site's terms of service and copyright notice says you can't save a copy of the site for personal use, then you can't do it legally.

      Not really true. Unless you explicitly agree to a TOS or EULA, (thus entering into a contract for the use of the material) a TOS cannot overturn fair use. Single copies made by individuals, for individual use, are well within fair use.

      The ruling regarding robots.txt is a bad ruling made by a judge who is trying to make the the best decision for the one instance before the court, not establish good law for all cases.

        I think you have been misled on the term, though it is a very slippery one that can be interpreted widely. Single copies made by individuals are not fair use. Fair use is brief, in relation to the whole, quotes or excerpts which aren't for direct gain and don't hurt the copyright holder.

        Somehow, if it only benefits one person and not a company, stealing has gained an air of common wisdom legality since Napster. If you buy something and make a copy for personal use, that's a different matter, depending on the mood of the court and the medium involved.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://477837]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2018-07-16 04:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (332 votes). Check out past polls.

    Notices?