Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Using text files to remove duplicates in a web crawler

by Stevie-O (Friar)
on Jul 07, 2004 at 07:40 UTC ( #372303=note: print w/replies, xml ) Need Help??

in reply to Using text files to remove duplicates in a web crawler

If you're going to be processing many thousands of URLs (25K seems like it satisfies this condition), you may want to try something called a Bloom Filter, for which a sample module already exists on CPAN (Bloom::Filter).

A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.

Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions. As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).

$"=$,,$_=q>|\p4<6 8p<M/_|<('=> .q>.<4-KI<l|2$<6%s!<qn#F<>;$, .=pack'N*',"@{[unpack'C*',$_] }"for split/</;$_=$,,y[A-Z a-z] {}cd;print lc

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://372303]
[marto]: That reminds me, I made some notes somewhere with regard W::M::C, I'll get round to a PR when I've time to flesh it out
[Corion]: erix: But that makes for fun bug hunting. "What version of Chrome are you running?" "v62". "I also run v62 and it works on my machine". :-(
[Corion]: marto: Great, looking forward to the PR!
LanX wonders, do we have a rule against systematic down voting?
[erix]: we frown :)
[marto]: Does the command line arg --product-version@ help?
[LanX]: xD
[erix]: 'systematic' is going to be hard to define I think
[marto]: Err no @
LanX Much enemy much ore

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (11)
As of 2017-12-12 20:19 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (335 votes). Check out past polls.