Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I'm building a complex web spider that will take content pointed to and build a binary representation of that data for display on a Palm handheld device.

During spidering, I'm trying to go through the links found, and skip any that happen to be "duplicate" links. Things like www.slashdot.org, which metas to slashdot.org (a 302), or www.cnn.com, which is the same content as cnn.com.

What's the easiest way to programatically determine if the link is a duplicate of another already stored in the hash of links already extracted from the page, without incurring a hit to the site itself to crc the content or HEAD (which both have their own flaws in design). I don't want to retrieve the same content twice, if one page links to www.foo.com, and another page in the same session links to foo.com.

Is this possible? Some magic foo with the URI module? I'm already using URI to validate that the URL is indeed properly formatted (and not file://etc/foo or ftp://foo and so on), but I'd like to eliminate any dupes during link extraction time, even with a HEAD request, before I spider them with GET (though I'd like to eliminate the double-hit with HEAD then GET on the same links).

Note, www.foo.com and foo.com may not be the same content, so I can't just regex off the 'www.' from the front of "similar" URLs.

Has anyone done anything like this before?


In reply to Eliminating "duplicate" domains from a hash/array by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-03-29 14:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found