Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

spidering, multi-threading and netiquette

by dannoura (Pilgrim)
on Feb 21, 2004 at 12:07 UTC ( #330785=perlmeditation: print w/ replies, xml ) Need Help??

hi,

I've recently built a web robot which is supposed to spider through an internet forum. This site has no robots.txt file. I estimate that the robot will send about one request per second. Offhand that wouldn't seem to be too much of a bother for that website (it has about 1,000 people online at any given time so the increased traffic wouldn't be substantial) but just to make sure I want to know if I'd be violating any etiquette issues.

I also thought of doing the spidering using parallel downloading of the pages. How would I go about doing this? Multi-threading? If I do implement this then for a short time the traffic will increase substantially. Could this be construed as an attack on the server?


p.s. I realize I could clear up some of this by contacting the website administrator but, at this point, for various reasons, I still don't want to do so, although I may in the future.

Comment on spidering, multi-threading and netiquette
•Re: spidering, multi-threading and netiquette
by merlyn (Sage) on Feb 21, 2004 at 12:12 UTC
    p.s. I realize I could clear up some of this by contacting the website administrator but, at this point, for various reasons, I still don't want to do so, although I may in the future.
    I was with you up to here. If what you're doing is a fair and reasonable use of the system, you should have no qualms in getting guidelines from the owners. The fact that you're a bit squeamish here has me wonder if you're violating some terms-of-usage.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      The fact that you're a bit squeamish here has me wonder if you're violating some terms-of-usage.

      Not really. It's just that this is for a research idea which could easily be copied (in fact I think the administrator should have thought of it himself), so I'd prefer it if what I was doing wasn't known until I have the results.

        But you could still address the webmaster with a question like:
        If I write a spidering agent that hits your site, do you have any guidelines, such as time-of-day, rate-of-hits, or areas examined?
        If you were to write me with that, I'd be very happy to tell you.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: parallel downloading
by fizbin (Chaplain) on Feb 21, 2004 at 13:48 UTC
    First of all, if you want to reduce your load on the server's bandwidth, an easy way to do that is to use persistent connections - unfortunately, unless some LWP development has happened that I'm not aware of, LWP doesn't support that. Fortunately, some other libraries (libwhisker, for example) do.

    That said, although rfc2616 is talking about persistent connections in this paragraph, I'd take it to heart even with non-persistent connections:

    Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.
    Finally, I'm a bit wary of the line "I estimate that the robot will send about one request per second." If that's your estimate, have some mechanism in place so that when it goes above 90 requests/minute the script is killed. I've seen far too many programs go wrong with a simple misplaced comma to trust that some program I write won't suddenly go wild without doing some testing first.

    The simplest way to do this is to log all requests to the screen and be fast with the Ctrl-C when things go bad.

      The simplest way to do this is to log all requests to the screen and be fast with the Ctrl-C when things go bad.
      To the OP -- If it has any chance of going out of control, there should be sleep instructions embedded in the code to reduce load. During debug, these intervals should be fairly long (0.5 - 1 second between requests?). Once you learn the script is well-behaved, you may be able to shorten them somewhat. As the bot writer, you have the utmost responsibility to limit your scans to the bare minimum possible. Not only does bandwidth cost money, but you could be slowing down access for other users. Also, if you are a simple spider, don't do something evil like run it continuously -- run it on a crontab (with a long interval) or manually.

      For sleeping between requests, check out Time::HiRes

      As to the multithread question, this should be entirely up to the site admin. If he says no, don't spider it at all. If this were my site, I'd consider a multithreaded spider quite abusive, since it would be doing things normal web browsers would not do.

        The best way to ensure you have reasonable delay's in your requests, is to use a User Agent that enforces those delays, ala: LWP::RobotUA and LWP::Parallel::RobotUA

        You should also keep in mind, that just because th server doesn't have a robots.txt today, doesn't mean it won't have one tomorow ... so make sure your code checks for it each time it's run: WWW::RobotRules.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://330785]
Approved by BazB
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (13)
As of 2014-12-17 21:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (34 votes), past polls