Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Infinite loop prevention for spider

by Wassercrats
on Nov 09, 2003 at 13:01 UTC ( #305671=perlquestion: print w/ replies, xml ) Need Help??
Wassercrats has asked for the wisdom of the Perl Monks concerning the following question:

This isn't really a programming question, but I don't know where else to ask. I want to make sure the bot I've created (in Perl) doesn't get caught in an infinite loop. I'm particularly concerned about encountering a URL containing dynamic PATH_INFO. Doesn't some PATH_INFO look like a regular directory? Is it possible that a dynamically created web page will include a link containing path info with some kind of id# that changes each time the page is loaded, but that always points to the same page? Should I handle that by limiting the number of pages parsed from a single domain to some arbitrary number? Any suggestions on what that number should be?

The script is very customized and essentially complete. The only modules it requires is LWP::UserAgent and HTTP::Request.

Thanks

Comment on Infinite loop prevention for spider
Re: Infinite loop prevention for spider
by Abigail-II (Bishop) on Nov 09, 2003 at 14:38 UTC
    This is impossible to determine from the client side. Suppose you are playing a text adventure, and you find yourself in a maze. All rooms have the same description. Just based on the description, you do not know whether you have been there before or not. And even if you remember all the pages, and say "if two pages have the same content, I consider them to be the same, even if the URLs differ", you can have a problem - for instance, the page may contain a 'counter' or a timestamp, making that the content is different each time.

    You might be able to come up with some heuristics, but then you will have to accept that you will have false positives and false negatives. And make sure you check a sites robots.txt - that should prevent a spider from getting into a loop.

    Off course, your question has nothing to do with Perl. You'd have to solve the same problems if you'd used any other language.

    Abigail

      Yes, I thought of the possible non-link time stamp issue. My current bot deletes all the URLs for various comparisons, but that might not be enough. I wonder what the typical way of dealing with this is.

      There is a new O'Reilly book out called Spidering Hacks. I hope I could find it in a book store near me (I'm not certain enough it would be helpful to shell out the money, sight unseen). And I hope people put the proper entries in their robots.txt files!

      Thanks

      The solution, then, is to start spidering with a large inventory of items (ie, a shovel, perhaps some miscellaneous treasure). As you spider each page, drop one of your inventory items into that page. Then when you visit a page again, you can tell which one it is by which inventory item is there.

      Oh, and make sure your spider has a lantern, or else it is likely to be eaten by a grue...

        > As you spider each page, drop one of your inventory items into that page

        Yeah, I wish servers accepted cookies from clients!

•Re: Infinite loop prevention for spider
by merlyn (Sage) on Nov 09, 2003 at 16:04 UTC
    As an experiment, for a while I had a link at http://www.stonehenge.com that consisted solely of -/, and I put a symlink in the web directory linking "-" to ".". That means that you could address any page on my site with an arbitrary number of "/-" throwaways, such as "/-/-/-/-/merlyn/columns.html".

    I did this to see what kind of similar-duplicate rejection algorithms the big indexing spiders use. Most of them recognized rather quickly that the pages were duplicate pages, but NorthernLights had indexed about 20 levels deep of the same pages before I turned the link off. Bleh!

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Infinite loop prevention for spider
by inman (Curate) on Nov 10, 2003 at 09:17 UTC
    This is a common problem and one to which there is no definate answer only various suggested approaches. It is also the bane of my life so I have a degree of sympathy for you. I have spent a lot of time indexing content from the web using a commercial search engine.

    Once upon a time the URL of a particular document could be treated as a unique(ish) identifier. Problems arise where you have documents that:

    1. contain the same content but have different URLs (e.g. this page will appear at the .org and .com perlmonks sites).
    2. are generated by a content managment that places some form of additional information in to the URL. e.g. a session identifier.
    3. the author decided should tell you the time ('cos none of us have watches!) and therefore change content slightly every time that you load them.

    As I mentioned earlier, there is no easy answer to this question, the best that you can look for is evidence that the documents returned are the same in an effort to detect loops during indexing. I would look for the following:

    1. The static part of the URL - Typically a document management / session management system will create a URL with a static part that allows it to identify the document and a dynamic part for session tracking. If you can identify the static part and use regexes to remove the dynamic part then you can create and track a list of pages.
    2. Look for an alternate piece of evidence, such as the title of the document or an internal ID generated as a Meta tag.
    3. Use CRC or a similar technique to discover documents that have the same content. This technique can be extended to discovering documents that are similar but have a tiny difference (e.g. just having a helpful 'the time is... section).

    Of course the most important technique would be to use an existing Spider tool which has all of this built in! The following list of resources culled from my favourites may be of interest:

    Good Luck

    inman

      Thanks, I'll take a look at those sites. It's starting to sound like I better put a limit on the number of pages that I spider per site, or else expect to wake up one morning to find a dizzy spider and a seize and desist email in my mail box, if I'm lucky enough to have internet service at all.
Re: Infinite loop prevention for spider
by Corion (Pope) on Nov 10, 2003 at 09:25 UTC

    I saw a talk by the author of String::Trigram, and he mentioned that he used his module for a similar problem, determining whether a webpage had changed or not. If you tune your similarity threshold good enough, this could be another measure for "page similarity" respectively "These two urls are the same page".

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://305671]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (9)
As of 2014-11-28 20:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (200 votes), past polls