Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^4: [OT] Ethical and Legal Screen Scraping (and courtesy)

by tilly (Archbishop)
on Jul 25, 2005 at 23:01 UTC ( [id://478013]=note: print w/replies, xml ) Need Help??


in reply to Re^3: [OT] Ethical and Legal Screen Scraping (and courtesy)
in thread [OT] Ethical and Legal Screen Scraping

Your flip comment about relativist ethics hits a sore point for me. Moral relativism does not absolve you from having a system of morals. And if you really believe it does, then depending on details you're a psychopath or a sociopath and I'd prefer that you be a long ways away from me (preferably in jail). For more on that read this post by me on another site explaining my views in more detail.

That said, I admit that different people will have different views of what is or is not ethical. And, as your examples illustrate, there are plenty of uses of an automated agent that most (including me) would agree justify ignoring robots.txt. But you have to think about what you're doing and why.

However at least one of your examples is questionable. Suppose that you're downloading 1-10 MB in short spans of time, and the machine that you're hitting is a public webserver hosted on someone's personal machine. A machine whose bandwidth is no better than your own. While you might not think that that's an issue, the webserver operator may not agree. Nor may other users. Nor may other people who are hosted on that machine. This applies for both personal webservers and also many small businesses. You may not know enough to determine whether this is an issue - but the website owner's opinion is right there in robots.txt.

Suppose, to take another example of yours, that you are scraping a pay-only portion of a site for personal use. Even though you believe that your use is ethical, the website owner has no way of knowing that. The website owner may or may not have had you click through an agreement that you won't use automated bots. Now what you're doing may be illegal (you are violating a contract) and is of questionable morality despite your justifications (you are breaking your word). Furthermore you run a real risk of having the website owner notice what you're doing and block you. (Whether or not there is an agreement.)

This doesn't just happen at pay sites. I know of multiple people who have found themselves shut out of Google after testing a bot there.

Now I'll agree that most websites don't monitor things that closely. As a practical matter, plenty of people ignore robots.txt and don't get caught. But I still think that if you are writing a bot, then you should either pay attention to robots.txt or have a good reason not to.

  • Comment on Re^4: [OT] Ethical and Legal Screen Scraping (and courtesy)

Replies are listed 'Best First'.
Re^5: [OT] Ethical and Legal Screen Scraping (and courtesy)
by Tanktalus (Canon) on Jul 26, 2005 at 01:54 UTC

    My "flip" comment on relativist ethics was, I think, completely misunderstood because you seem to have a very large sore spot surrounding the issue. The point is that if you have "relative" ethics (everyone is entitled to their own ethics and are not more or less "good" than the ethics of others), there is no common measuring stick that you and I can both apply to the situation at hand. Since we are even having a common discussion, I have to assume absolutist ethics (one thing may be considered "more ethical" than another, regardless of who does it - this doesn't mean that we necessarily agree on what is more ethical than something else, just the concept that there is such a measuring stick), otherwise there is no discussion. I merely pointed out this assumption because if you wanted to disagree on which framework to surround the discussion, I would show that I've actually thought about the differences and we could have a frank, but brief, discussion on that point.

    I do not believe I have mischaracterised ethical relativism in any way. I pointed out the one requirement is that your belief is "sincere" (i.e., you're not just making it up to suit the situation). Of course, as with the rest of relativistic philosophy, the definition of "sincere" is still a bit blury, thus the quotes.

    Back to the question at hand. You continue your misunderstanding by questioning my "courtesy" example with an example which I have explicitly taken care of with the last sentence in the post you are replying to. If that machine is a public webserver hosted on someone's personal machine, I would argue that this is the scenario where you "need to take into account whether the site owners are expecting that type of traffic (e.g., CNN) or not (e.g., PerlMonks ;-})."

    As to the website owner who decides to block you from access that you're paying for. I want to point out that I treated the three questions, legal, ethical, and courtesy, independantly. Just because something may be ethical doesn't mean it's legal, or vice versa. Just because something may be done out of courtesy does not mean that ethics demands it, nor that the law does. So I may believe it's ethical to scrape the website for the data you've paid for, but the legal contract may say otherwise. And that's completely valid description of reality (whether or not I agree that the contract is fair, i.e., ethical).

    Service providers provide their service under certain conditions which they are legally entitled to enforce. However, if what they are blocking would fall under "fair use" style common law, they may find their customers leaving for providers who provide more lenient access.

    Google gets away with what they're doing by having a free registration system for API access where they can ensure a quality of service to everyone simultaneously. It seems eminently fair, and far-sighted, to me. However, if CNN blocked me from minor amounts of scraping (where I define minor to be in the "ethical" example above), I would simply find another news provider. Say CBC.ca or news.yahoo.com or maybe something on Google. The choice to serve is equal to the choice to find a new provider of service ;-)

Re^5: [OT] Ethical and Legal Screen Scraping (and courtesy)
by polettix (Vicar) on Jul 25, 2005 at 23:41 UTC
    You're introducing another concept here, morality. Which is definitively not ethics.

    Without any intention to start any flame, I only urge to tell you that I've been stuck by this seemingly casual consideration:

    preferably in jail
    Are you saying that you would prefer a person to be in jail only for not having a moral system? If so, this mines the principle of one being innocent until his guiltiness is proven - something that someone from Brazil would not appreciate too much in these days, if we only could still talk to him. And he's only one drop in the ocean.

    If I was wrong reading you (and I really hope I was), please accept my excuses in advance.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
      No. I would prefer that people be in jail for good cause.

      However if you actually have no moral system, then odds are good that you both have done things which you could be put in jail for, and routinely hurt (emotionally or physically) those around you. Which is why I'd prefer that you actually be caught and put in jail.

      The old term for someone with no moral system was "psychopath". When I say "no moral system" I mean a person who could lie, cheat, steal, rape, injure, maim or kill with no provocation or excuse, and who would feel no sense of guilt about having done any of those things. The only restraint on people like this is fear of being caught and punished. That restraint is important, most with this disorder do not actually go out and rape, injure, maim or kill random people. (Note that this list is shorter than the previous one...) But the deeds of some with the disorder has fixed a popular image of the psychopath as a crazy killer. Therefore the term "sociopath" is being used these days instead. The full-blown disorder is thankfully rare, but one could always wish that it was rarer still.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://478013]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-25 14:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found