Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Using text files to remove duplicates in a web crawler

by Stevie-O (Friar)
on Jul 07, 2004 at 07:40 UTC ( #372303=note: print w/ replies, xml ) Need Help??


in reply to Using text files to remove duplicates in a web crawler

If you're going to be processing many thousands of URLs (25K seems like it satisfies this condition), you may want to try something called a Bloom Filter, for which a sample module already exists on CPAN (Bloom::Filter).

A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.

Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions. As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).

--Stevie-O
$"=$,,$_=q>|\p4<6 8p<M/_|<('=> .q>.<4-KI<l|2$<6%s!<qn#F<>;$, .=pack'N*',"@{[unpack'C*',$_] }"for split/</;$_=$,,y[A-Z a-z] {}cd;print lc


Comment on Re: Using text files to remove duplicates in a web crawler
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://372303]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (13)
As of 2014-07-23 20:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (152 votes), past polls