hacker has asked for the wisdom of the Perl Monks concerning the following question:
During spidering, I'm trying to go through the links found, and skip any that happen to be "duplicate" links. Things like www.slashdot.org, which metas to slashdot.org (a 302), or www.cnn.com, which is the same content as cnn.com.
What's the easiest way to programatically determine if the link is a duplicate of another already stored in the hash of links already extracted from the page, without incurring a hit to the site itself to crc the content or HEAD (which both have their own flaws in design). I don't want to retrieve the same content twice, if one page links to www.foo.com, and another page in the same session links to foo.com.
Is this possible? Some magic foo with the URI module? I'm already using URI to validate that the URL is indeed properly formatted (and not file://etc/foo or ftp://foo and so on), but I'd like to eliminate any dupes during link extraction time, even with a HEAD request, before I spider them with GET (though I'd like to eliminate the double-hit with HEAD then GET on the same links).
Note, www.foo.com and foo.com may not be the same content, so I can't just regex off the 'www.' from the front of "similar" URLs.
Has anyone done anything like this before?
|
---|
Replies are listed 'Best First'. | |
---|---|
•Re: Eliminating "duplicate" domains from a hash/array
by merlyn (Sage) on Mar 30, 2003 at 22:06 UTC | |
by pg (Canon) on Mar 31, 2003 at 02:59 UTC | |
by bsb (Priest) on Mar 31, 2003 at 11:21 UTC | |
by merlyn (Sage) on Mar 31, 2003 at 15:41 UTC | |
Re: Eliminating "duplicate" domains from a hash/array
by Anonymous Monk on Mar 30, 2003 at 22:34 UTC | |
Re: Eliminating "duplicate" domains from a hash/array
by aquarium (Curate) on Mar 31, 2003 at 05:08 UTC | |
by hacker (Priest) on Mar 31, 2003 at 12:02 UTC | |
Re: Eliminating "duplicate" domains from a hash/array
by thpfft (Chaplain) on Mar 31, 2003 at 14:48 UTC |