Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
The problem with full-content-comparison, and something I'm trying to avoid, is that for sites like freshmeat.net, their main page is 142,178 bytes in length (a few minutes ago, it changes frequently). Having to fetch that same content potentially multiple times, then compare, would be extremely slow on slower client connections, especially if I'm going to discard it anyway as a duplicate.

Also, if my spider takes an hour to fetch a series of links from a site, and the first link is freshmeat.net and the last link in the fetch (an hour later) is www.freshmeat.net which now has a few extra things added to the top of the page since the fetch began (as they always do), the content will be different, but there is no need to fetch it again, since I already have "mostly" current content in the first link I fetched during this session.

I realize HEAD information is also not the best approach, because:

  • Not all servers support HEAD
  • Every time you HEAD a site, you'll get a differen Client-Date, which will change your comparison
  • Multiple servers can serve the same content (ala google.com, which currently shows up with two hosts, while something like crawler1.googlebot.com reports 30 separate IP addresses).
  • HEAD incurs a "double-hit" to the site, if the content is valid, I'd like to avoid the HEAD then GET on the same site, or for each link found.

It's definately a tricky subject, but I'm sure there are some ways to avoid doing this.

One I thought of while I was sleeping last night, was to constantly maintain a small Berkeley dbm (or flat file) of hosts and potential "duplicate" URIs which they are known to come from, and keep that current on the client side, and check that each time I start the spider up to crawl new content.


In reply to Re: Eliminating "duplicate" domains from a hash/array by hacker
in thread Eliminating "duplicate" domains from a hash/array by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-20 00:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found