I'm building a complex web spider that will take content pointed to and build a binary representation of that data for display on a Palm handheld device.
During spidering, I'm trying to go through the links found, and skip any that happen to be "duplicate" links. Things like www.slashdot.org, which metas to slashdot.org (a 302), or www.cnn.com, which is the same content as cnn.com.
What's the easiest way to programatically determine if the link is a duplicate of another already stored in the hash of links already extracted from the page, without incurring a hit to the site itself to crc the content or HEAD (which both have their own flaws in design). I don't want to retrieve the same content twice, if one page links to www.foo.com, and another page in the same session links to foo.com.
Is this possible? Some magic foo with the URI module? I'm already using URI to validate that the URL is indeed properly formatted (and not file://etc/foo or ftp://foo and so on), but I'd like to eliminate any dupes during link extraction time, even with a HEAD request, before I spider them with GET (though I'd like to eliminate the double-hit with HEAD then GET on the same links).
Note, www.foo.com and foo.com may not be the same content, so I can't just regex off the 'www.' from the front of "similar" URLs.
Has anyone done anything like this before?