(OT) safe to mechanize?

by Anonymous Monk
on Aug 03, 2006 at 13:39 UTC ( #565445=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to parse someone's site, but I don't want them to know that I'm parsing it.

Can a site track down who is parsing them? My program will need to login. Also, will the webmaster notice that someone is parsing the site? (ie. lots of 'hits' to the site).


Re: (OT) safe to mechanize?
by ptum (Priest) on Aug 03, 2006 at 13:48 UTC

    While I can think of several reasons you would want to do this, I can't think of any that make me want to help you. This isn't really a Perl question, either ... you might want to update your node to demonstrate that (a) you are doing something that is legal and moral, and (b) you are attempting something in Perl.

    No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re: (OT) safe to mechanize?
by Fletch (Chancellor) on Aug 03, 2006 at 13:54 UTC

    Of course they can tell. Webmasters have eerie powers. And legions of flying monkeys which they'll send out over the Intarweb to track you down and give you such a wedgie.

    Not to mention they monkeys' eerie powers. Well, not really eerie; more . . . preternatural. I mean they can fly and give superwedgies, but that's about it. Above and beyond what one would normally expect from monkeys, at least.

    Oh, and if you've ever logged in to the site from anywhere you'd better make sure to check the name on your waistband now (because that's how the monkeys check they've got the right guy; if you don't have your name there, they'll take your wallet to first check your ID and give you the aforementioned wedgie).

Re: (OT) safe to mechanize?
by dorward (Curate) on Aug 03, 2006 at 13:44 UTC

    It sounds like you are planning to do something immoral and/or illegal - just don't. (And the site administrator will probably have the data they need to track you down should they want to.)

      What about privacy concerns? What if he means he's a human rights researcher and wants to avoid arrest by a repressive government?

        What if he's an agent of a repressive government who wants to surreptitiously research on human rights organizations in order to find dissidents to arrest?

        If what they're trying to do is above board, state the objective up front rather than coming at it obliquely. There've been posters in the past that had questions in a similar vein that turned out to be trying to defraud someone (there've been other script kiddies; that's just the one that sticks in my mind due to the persistence and the astonishing utter ignorance of how networks work (OK, and the presence of one of my all time highest rep nodes :)).

Re: (OT) safe to mechanize?
by imp (Priest) on Aug 03, 2006 at 13:49 UTC
    If you want a better chance of going undetected you could pause between requests, perhaps by using WWW::Mechanize::Sleepy.

    It would also be wise to set your user agent to something normal.

    This might be sufficient to avoid suspicion for most sites, but some sites will be more paranoid (I'd imagine safari would be one of these).

      As I mentioned, a sufficiently paranoid site will likely catch you, by detecting behaviour that is not typical of humans.

      A couple suspicious activities:

      • Opening every link of a page in sequence
      • Opening links that are not visible. This could be due to style settings, or being in an invisible block
      If it's a commercial site with a monetary interest to protect, you will likely be caught.
Re: (OT) safe to mechanize?
by perlmonkey2 (Beadle) on Aug 03, 2006 at 18:06 UTC
    Would it be wrong/immoral to bring up the legions of free proxy servers on the net? It would take some complex code, but you could take a list of 100 proxy servers and randomly chose one for your next link to be opened. Or if you don't care if they know they were parsed, your only concern is you don't want them to know YOU parsed them, you could use just one proxy. Keep in mind that proxies protect you from a mad webmaster, they do nothing to protect you from the law.

      Yeah, that'll really help to hit the web site from a bazillion different anonymous proxies . . . so he can log into the site to spider it.

