Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

[OT] Ethical and Legal Screen Scraping

by eyepopslikeamosquito (Canon)
on Jul 25, 2005 at 13:59 UTC ( #477825=perlmeditation: print w/ replies, xml ) Need Help??

As suggested by Abigail-II in Re: Web Robot, a polite robot should:

  • Obey robots.txt.
  • Don't flood a site.
  • Don't republish, especially not anything that might be copyrighted.
  • Abide by the site terms of service.

I was further interested to learn in Chip Salzenberg's letter at geeksunite that: "Federal courts have upheld that web spiders must obey the established robots.txt mechanism by which web site owners limit automated access and that a failure to obey robots.txt constitutes trespass".

However, I'm confused about who robots.txt is intended for. I understand robots.txt applies to heavy duty web spiders and indexers, such as a Google robot. But does it also apply to little screen scraping tools written by private individuals? For example, suppose I write a little tool using LWP::UserAgent or WWW::Mechanize (rather than LWP::RobotUA or WWW::Mechanize::Polite ?, say) that simply collects a number of web pages for me while I sleep. Is it illegal or unethical for such a scraper to ignore robots.txt?

If a commercial company sells a tool that allows non-programmer end users to write little screen scraping robots, is it unethical or illegal for such a product to not provide a mechanism to allow their end users to respect robots.txt?

Comment on [OT] Ethical and Legal Screen Scraping
Re: [OT] Ethical and Legal Screen Scraping
by wazoox (Prior) on Jul 25, 2005 at 14:33 UTC
    Is it illegal or unethical for such a scraper to ignore robots.txt?

    First, I'd affirm strongly that illegal is definetely not equivalent to unethical. Actually, the new federal rule which demands that web spiders obey robots.txt may be legal, but it seems unethical to me.
    robots.txt as I understand it isn't in any manner an access control system. Declaring it so in a legal manner and enforcing it is plain nonsense from a sick justice going mad.
    The RFC defining the robots.txt standard ( robots.txt RFC ) states it very clearly:

    It is solely up to the visiting robot to consult this information and act accordingly. Blocking parts of the Web site regardless of a robot's compliance with this method are outside the scope of this memo.

    Regarding your own personal web spider, I'd say : who may ever knows that you sucked up a site with it? How can someone prove that you didn't hit "ctrtl+S" in your browser while visiting the site? How can someone may forbid you to save a personal backup copy of a publicly available document? This doesn't make sense. Republishing content like google cache or archive.org do may be questionable, but you're definetely allowed to store an unmodified copy of a web site for your personal use, unless you're living in Iran or China.

      How can someone may forbid you to save a personal backup copy of a publicly available document?

      This whole thing is a can of worms and it may not be sorted out clearly and legally for decades, however, copyright (in the US anyway) means the rights to all copies, which includes digital. So if a site's terms of service and copyright notice says you can't save a copy of the site for personal use, then you can't do it legally.

      Who's to stop you? Probably no one. But just because something is technically possible doesn't suddenly make it ethical. It would be technically trivial for me to shoot out my neighbor's windows at 3am with a pellet gun from 250 yards and be all but assured of getting away with it. Technology, ability, doesn't change morality.

      The more respect--and peer pressure for continued mutual respect--we have for these sorts of things, the more open it can all be and remain. The more the windows are broken, the more the government cops have excuses to climb in and "protect" us.

        Caching has been traditionally considered acceptible, legal, and ethical. Most web browsers do it on a small scale, and Google does it on a large scale. If the purpose of a scraper is to enable, for example, an "offline reader", there is no ethical dilemma.

        Even long-term caches that are not shared with others are not an ethical dilemma so long as the material on the web site being cached remains publicly available. Beyond that point, there is a dilemma, and one has to consider whether the good of continued access to that data outweighs the good of complying with the copyright-holder's wishes. I'd take that on a case-by-case basis.

        <-radiant.matrix->
        Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
        The Code that can be seen is not the true Code
        "In any sufficiently large group of people, most are idiots" - Kaa's Law
        This whole thing is a can of worms and it may not be sorted out clearly and legally for decades, however, copyright (in the US anyway) means the rights to all copies, which includes digital. So if a site's terms of service and copyright notice says you can't save a copy of the site for personal use, then you can't do it legally.

        Not really true. Unless you explicitly agree to a TOS or EULA, (thus entering into a contract for the use of the material) a TOS cannot overturn fair use. Single copies made by individuals, for individual use, are well within fair use.

        The ruling regarding robots.txt is a bad ruling made by a judge who is trying to make the the best decision for the one instance before the court, not establish good law for all cases.

      You seem to be confused about ethics.

      Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically. If you do something unethical, it is unethical whether or not you're caught. If you find yourself having to make excuses for your behaviour, the odds are very good that you're behaving unethically.

      If you write a program to scrape another website, even if it is just for personal use, courtesy and ethics says that you should pay attention to robots.txt. Sure, you can do by hand anything that you automate. But when you automate you're likely to do a lot more of it than when you do it by hand. And you're likely to do it a lot faster. This has implications for the website that you're visiting, and it makes sense that website operators would ask you to be particularly polite to them.

      If you say, "Oh, this is just for personal use" and turn a poorly written spider loose on a site, you're being rude and unethical. The website operator may well choose to repay your rudeness in kind and block you. They don't even have to go to court to do it either - they just notice that you're a bandwidth hog and lock you out.

      But you asked several hypothetical questions. Here are not so hypothetical answers. Someone might recognize that you didn't just use your browser because of the speed with which you hit the site, because of your user agent, because they get access to your computer and find the program that you used to do it. There are other things that might strike them as suspicious, but the above is a good starting list.

        The more I read your rebuttal to wazoox's point, the more I agree with you. So I'm only going to write about the points where I still disagree.

        First, you've introduced a new term. We started with ethical and legal questions. You've introduced courtesy. I think this is yet another question, quite separate from the original two. That said, it's still a very relevant question.

        Legal question: Disclaimer: I am not a lawyer. I also do not play one on TV. However, it's quite likely that the legal question is only a question for commercial entities. Of course, with OSS "corporations", the line between private, not-for-profit, and commercial starts getting quite blurred. Most likely, if you were to scrape someone's site for purely personal purposes, and the company objected, the courts would laugh the case right into the "dismissed" bucket. And likely get the website owner to pay your legal bills, if you asked nicely enough.

        Ethical question: since ethics seem to be relativist these days, arguably nothing is unethical if you "sincerely" believe it to be ethical. Subscribing to a higher standard, the question becomes one of intent. Is your intent to profit from this scraping in ways that the site owner did not intend for you to profit? For example, scraping a news site so that you can send it to your PDA so you can read it on the train to work is profiting precisely the way the site owner desires you to profit from their site. If the site provides a PDA version of their site and you use that URL, all the better. Scraping the pay-only portions of the site to redistribute, however, would not be ethical. (And, in a cruel twist of fate, probably illegal, too ;-})

        Finally, courtesy. If you're hitting a site for abnormal amounts of data, e.g., 50Kb where average pages are 10Kb, or 1MB where average pages are 100Kb, you may want to send them a heads-up. But I'm not really sure it's really required until you get into excess. For example, if you were downloading 1-10MB in short spans of time (e.g., full-speed on high-speed connections) multiple times per day (and I wouldn't really count "two" as multiple here - e.g., if you were grabbing the news on the way to work, and again right before you left work to go home). Then it might be warranted. Otherwise, I'm not really sure that most sites would care. You would also need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-}).

        You seem to be confused about ethics. Behaving ethically is not defined by anyone else's ability to prove that you did or did not behave ethically.

        Looks like you didn't understand me. You're confusing the "ethic" aspect of my answer with the "legal" aspect. I mean "enforcing legally the respect of robots.txt is abusive and unethical.". I mean too "I can't imagine that you'd be legally in trouble for web-spidering, for personal use, without taking care of robots.txt, except in China or Iran.". I never meant "Go suck all websites you want to and don't bother", ever.

        I think too the generalisation "if you use a web-spidering program without taking notice of robots.txt, you're ethically wrong", is questionable. Ethics aren't that simple. Perhaps he's a political refugee trying to extract important information from some hostile hidden website. Is he "right" or "wrong"?

Re: [OT] Ethical and Legal Screen Scraping
by Anonymous Monk on Jul 25, 2005 at 15:08 UTC
    If a commercial company sells a tool that allows non-programmer end users to write little screen scraping robots, is it unethical or illegal for such a product to not provide a mechanism to allow their end users to respect robots.txt?
    First of all, I don't think the answer to the question depends on whether the tool is sold or not, or whether it's produced by a company that may, or may not, be commercial.

    As for the legal status, like many other things on the internet, that will remain unclear for a long time. Jurisdiction concerning internet matters grows very slowly - the fact the internet doesn't care about our legal borders is a not unimportant factor in the slow growth of jurisdiction.

    As for the ethical thing. If it doesn't allow a mechanism to respect robots.txt, it's certainly a bad product. And I guess selling bad products is unethical. It's certainly unethical for the end user to use such a tool though. (Well, IMO). It's the user of robot created that way that carries the responsibility - just like you carry the responsibility your car has headlights and brakes, the judge won't accept a defense of "but it was sold to me this way".

Re: [OT] Ethical and Legal Screen Scraping
by brian_d_foy (Abbot) on Jul 25, 2005 at 17:06 UTC

    The robots.txt file can specify sections that apply to a particular user-agent or to all user-agents. It can be for any user-agent, including one that you write yourself.

    You know about robots.txt, and you know that it is a statement of the website's operator about how they want an automated agent to access their website. You have to decide for yourself whether ignoring those instructions violates your ethics. However, I tend to think that if you have to ask the question, you already know there is an ethical problem.

    For the legal questions, you'll have to talk to a lawyer who can handle the various local (or international) laws that may apply. However, I'd much rather you obey the spirit of the mechanism rather than the letter of the law. If things get too out of hand, we'll just get more regulation.

    --
    brian d foy <brian@stonehenge.com>
Re: [OT] Ethical and Legal Screen Scraping
by jhourcle (Prior) on Jul 26, 2005 at 01:11 UTC

    The general concept of robots.txt was to restrict automated processes with no user behind them. Its original syntax makes very little sense, from a security point of view (I mean, it basically tells people 'here's the stuff I don't want you looking at'). The later RFC included an 'Allow' to go along with 'Disallow'.

    If a site really didn't want people visiting their content, they'd use access restrictions. (and they can even filter by user agent + path, just like with a robots.txt file), but the robots.txt tells the robot to not even bother requesting any of those other files.

    It is not intended for user-agents, that is to say something that has a user at the helm -- for instance, a web browser that's retrieving files as you request them (even if you go and option-click 50 links on the page, so each one pops up in a new window in your browser). Or some of the more annoying browsers that go and pre-fetch every page that's linked to, just in case you might follow a link.

    I'd make sure to advertise my screen scraper with a unique user-agent string, and I'd look at robots.txt, in case they wanted to politely ask me to go away... but would it be unethical to ignore it? I'd say in your case, yes -- you're planning on doing it when you sleep. If you had a user agent that presented the content in a different format (ie, acting a screen scraper, but interactive, not automated), I'd say it'd be okay.

    Now, if you were going to start looking at robots.txt, I would think it would be unethical to then decide to ignore it-- it's one thing to say 'I am a user agent, not a robot', and not check for it, but it's a bad thing to look to see if they want you to go away, and then ignore the request.

Re: [OT] Ethical and Legal Screen Scraping
by tlm (Prior) on Jul 26, 2005 at 03:16 UTC

    For example, suppose I write a little tool using LWP::UserAgent or WWW::Mechanize (rather than LWP::RobotUA or WWW::Mechanize::Polite ?, say) that simply collects a number of web pages for me while I sleep. Is it illegal or unethical for such a scraper to ignore robots.txt?

    Whether it is legal or not I won't get into, but AFAIC, I don't have any ethical objection to such a tool as long as it doesn't impose a greater load on the target server(s) than you would if you were to perform the same task manually.

    the lowliest monk

Re: [OT] Ethical and Legal Screen Scraping
by inman (Curate) on Jul 26, 2005 at 11:35 UTC
    Is it Illegal? - That depends on the jurisdiction that applies. The legal implications of ignoring robots.txt may only apply under US federal law. If you are in another jurisdiction and broke a US federal law then the relevant authorities would have to start extradition proceedings against you. The general principal with extradition is that you must be accused of an offence which would be illegal in your jurisdiction so it all depends on your own circumstances.

    Is it unethical? - That's up to you and your individual interpretation of right and wrong. If you are acting in a professional capacity or on behalf of a corporation then you will need to adopt a different ethical code that applies to your position. Elements of this may however contradict your personal ethics.

    The important thing to remember about law and ethics is that both are changing all the time. New laws are written and old ones discarded or updated. Personal ethics also change as societies views on issues evolve. It was once both legal and ethical to own slaves in the US. Now it is neither.

Re: [OT] Ethical and Legal Screen Scraping
by spiritway (Vicar) on Jul 27, 2005 at 02:38 UTC

    This seems to be a really hot topic...

    First, the article by Mr. Salzenberg alludes to a Federal law requiring spiders to obey the robots.txt file, but he unfortunately fails to cite to that law. It isn't even clear whether he is referring to statute or case law, the distinction being crucial.

    I think it's clear that if someone has troubled to create a robots.txt file, he or she intends that spiders follow it. True - unless directories are password protected, people or spiders can access them. However, that doesn't make it legal or ethical to violate the request of the robots.txt file.

    To my way of thinking, "ethics" is more or less applying the Golden Rule, or not doing to people what you would find objectionable. It also includes not doing what *they* find objectionable, within reason. In this case, even if I don't mind spiders sucking up my Website, being ethical would require that I not do this to others, if they ask me not to.

    The question of what a Webmaster intended is simple to resolve - ask him or her. People do things for various reasons. Some might be trying to protect their Websites from Googlebot, but have no objection to you scraping it. Others might object to anyone using bandwidth unless there is a human being doing it - perhaps to view the ads on the site, or for other imponderable reasons. So, to me, the best way to resolve the matter is to ask the Webmaster.

    As for the law, I doubt very much that anything most of us do will ever come to the attention of the authorities, unless someone sucks up a whole commercial Website and presents it as their own. Even then, the most likely result would be a sternly-worded "cease and desist" letter from Boyd, Dewey, Cheetham, & Howe, LLC.

    I think that ethics is (are?) a personal issue where opinions are likely to vary widely. Even if everyone tries to abide by the Golden Rule, people are so widely divergent in their tastes that there is likely to be much disagreement. This is one of the reasons why laws are made - to enforce what is usually an unsatisfactory compromise.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://477825]
Approved by Tanalis
Front-paged by tomhukins
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2014-10-21 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (95 votes), past polls