Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Hashing urls with Adler32

by BrowserUk (Patriarch)
on May 31, 2007 at 18:17 UTC ( [id://618552]=note: print w/replies, xml ) Need Help??


in reply to Hashing urls with Adler32

Looking up Adler32 brought the following to light:

Weakness

Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol.

As most urls are considerably shorter than 128 bytes, that suggests that Adler32 is not a good choice for your application and that you would better consider the CRC32 algorithm.

Whichever algorithm you use there is going to be some chance of false positives. However, you can reduce these chances by a considerable margin by using two different hashing algorithms. If you were to calculate and store both the CRC32 and the Alder32 for each url, the odds of a simultaneous collision for both hashes for any given pair of urls is vastly reduced.

Of course that means storing twice as much information which is a part of your original problem. However, there is a way of storing both sets of hash data such that it requires minimal memory (10kb or so) whilst giving almost the same lookup performance (15 microsecs/lookup compared to 5 microsecs) as Perl's hashes.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://618552]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-25 06:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found