hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a complex web spider that will take content pointed to and build a binary representation of that data for display on a Palm handheld device.

During spidering, I'm trying to go through the links found, and skip any that happen to be "duplicate" links. Things like www.slashdot.org, which metas to slashdot.org (a 302), or www.cnn.com, which is the same content as cnn.com.

What's the easiest way to programatically determine if the link is a duplicate of another already stored in the hash of links already extracted from the page, without incurring a hit to the site itself to crc the content or HEAD (which both have their own flaws in design). I don't want to retrieve the same content twice, if one page links to www.foo.com, and another page in the same session links to foo.com.

Is this possible? Some magic foo with the URI module? I'm already using URI to validate that the URL is indeed properly formatted (and not file://etc/foo or ftp://foo and so on), but I'd like to eliminate any dupes during link extraction time, even with a HEAD request, before I spider them with GET (though I'd like to eliminate the double-hit with HEAD then GET on the same links).

Note, www.foo.com and foo.com may not be the same content, so I can't just regex off the 'www.' from the front of "similar" URLs.

Has anyone done anything like this before?

  • Comment on Eliminating "duplicate" domains from a hash/array

Replies are listed 'Best First'.
•Re: Eliminating "duplicate" domains from a hash/array
by merlyn (Sage) on Mar 30, 2003 at 22:06 UTC
    You can't do it precisely programmatically. You have to determine that two pages are "close enough" when you hit them. I know Google knows to do that, but I've run into other web walkers that don't.

    I found that out by putting a link on my webserver root page to -. And I symlinked "-" to "." in my root doc directory. So any page on my website was accessible by any number of /-/-/-/-/- prefix chars before the real URL. Google immediately figured it out, but I had other webcrawlers visiting (and indexing!) my entire web site some 15 or 20 times deep before giving up.

    If you are spidering your own site, you can add code in your spider to canonicalize your URLs before fetching. I did that in a few of my columns.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      For the case of symlink '-' to '.', that is obviously a kind of problem that can be resolved precisely programmatically. The fact that Google can resolve it, clearly shows this is resolvable; The fact that others cannot deliver the same thing, only means their programs are not smart enough.

      We have to clearly identify what is logically doable, and what is not. Something nobody handles or somebody handled badly does not necessary to be logically unresolvable.

      The actual difficulty to compare URL's, has really nothing to do with this kind of small trick, which is obviously logically and programmatically resolvable.

      The real problem is that, the solution to this kind of issue is largely related to the internal structure of each particular site, which is not regulate by any standard, and could be so different from site to site.

      We have to realize/remember that no search engine is just a set of programs, instead it is a set of programs + MANUALLY MAINTAINED INFOS. Without those MANUALLY MAINTAINED INFOS, there is no google or any other search engine.
      I'm really curious, why were you doing this?


Re: Eliminating "duplicate" domains from a hash/array
by Anonymous Monk on Mar 30, 2003 at 22:34 UTC

    As any given url could be a cgi or an html that uses server-side includes, there is no way to guarentee that even fetching the same url twice within any given timeframe will result in identical return.

    Any mechanism for determining whether the results of different urls is the same will have to rely on fetching them and comparing the results. This might lead to some optimisazion in storage by having the 2 urls point to the same data, but pre-determining is just not possible.

    Even storing the data offline is fraught with problems in that there is no guarentee that the content of an entirely static page will not be updated 1 day/hour/minute/second/microsecond after you captured and stored it.

Re: Eliminating "duplicate" domains from a hash/array
by aquarium (Curate) on Mar 31, 2003 at 05:08 UTC
    Use one of the DNS modules to lookup the A records. This will give you IP addresses which you can check against each other. Alas, this is far from fullproof: load sharing www servers via DNS will give you different IPs each time you lookup, and also a single server can serve many sites, all of them will have same ip (apache virtual names.) One step better is head information, but still not fullproof. Ultimately only full content comparison is 100% proof of content similarity or otherwise. Chris
      The problem with full-content-comparison, and something I'm trying to avoid, is that for sites like freshmeat.net, their main page is 142,178 bytes in length (a few minutes ago, it changes frequently). Having to fetch that same content potentially multiple times, then compare, would be extremely slow on slower client connections, especially if I'm going to discard it anyway as a duplicate.

      Also, if my spider takes an hour to fetch a series of links from a site, and the first link is freshmeat.net and the last link in the fetch (an hour later) is www.freshmeat.net which now has a few extra things added to the top of the page since the fetch began (as they always do), the content will be different, but there is no need to fetch it again, since I already have "mostly" current content in the first link I fetched during this session.

      I realize HEAD information is also not the best approach, because:

      • Not all servers support HEAD
      • Every time you HEAD a site, you'll get a differen Client-Date, which will change your comparison
      • Multiple servers can serve the same content (ala google.com, which currently shows up with two hosts, while something like crawler1.googlebot.com reports 30 separate IP addresses).
      • HEAD incurs a "double-hit" to the site, if the content is valid, I'd like to avoid the HEAD then GET on the same site, or for each link found.

      It's definately a tricky subject, but I'm sure there are some ways to avoid doing this.

      One I thought of while I was sleeping last night, was to constantly maintain a small Berkeley dbm (or flat file) of hosts and potential "duplicate" URIs which they are known to come from, and keep that current on the client side, and check that each time I start the spider up to crawl new content.

Re: Eliminating "duplicate" domains from a hash/array
by thpfft (Chaplain) on Mar 31, 2003 at 14:48 UTC

    Perhaps the answer isn't so complicated. It would be crazy to do a full-content comparison across your whole database, but it will probably be sufficient to single out a few special cases and likely aliases and just test for those. in the case you mention - www.foo.com eq foo.com - it would be very easy to look out for just that pair and checksum the two pages to make sure they differ.

    incidentally, since you're doing something clever and complicated to the content of each item later on anyway, perhaps you'd be better off just grabbing everything and deferring redundancy checks until you start munching it up?

    in other words, my .02p says this situation requires laziness, not hubris.