Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Using text files to remove duplicates in a web crawler

by Stevie-O (Friar)
on Jul 07, 2004 at 07:40 UTC ( #372303=note: print w/replies, xml ) Need Help??

in reply to Using text files to remove duplicates in a web crawler

If you're going to be processing many thousands of URLs (25K seems like it satisfies this condition), you may want to try something called a Bloom Filter, for which a sample module already exists on CPAN (Bloom::Filter).

A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.

Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions. As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).

$"=$,,$_=q>|\p4<6 8p<M/_|<('=> .q>.<4-KI<l|2$<6%s!<qn#F<>;$, .=pack'N*',"@{[unpack'C*',$_] }"for split/</;$_=$,,y[A-Z a-z] {}cd;print lc

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://372303]
and the sunlight beams...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2017-07-26 01:10 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (383 votes). Check out past polls.