by isync (Hermit)
 on May 31, 2007 at 14:27 UTC Need Help??

isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi there!

Are there any caveats in hashing http://www.somesite.com/ urls with with the adler-32 checksum algorithm?

Can I expect a unique hash for each unique url - in other words: will adler32 generate checksums that correspond to just this specific url or will it (in few cases) generate the same checksum for two different urls?

Thanks!

Replies are listed 'Best First'.
by moritz (Cardinal) on May 31, 2007 at 14:39 UTC
The more URLs you hash, the higher the probability for a collision is.

This is true for all hashing algorithm.

I personally would take the (relative small) risk for less than 10^4 hashes, with more hashes I'd rely on something else or implement a scheme that makes hash collisions non-fatal if I had to use hashes.

by Tomte (Priest) on May 31, 2007 at 14:40 UTC

It produces an output of fixed length for an input of arbitrary length. So there is an infinit set of possible inputs mapped to a finit set of checksums - so the algorithm can't produce a unique checksum for every url you feed it. I suggest you read the article on wikipedia and then have a look at SHA1 as possibly the better solution - that nonetheless will not be bijective either (It will produce collisions!) - it depends on your problem at hand if this is a hindrance.

regards,
tomte

An intellectual is someone whose mind watches itself.
-- Albert Camus

Currently I am using MD5 as digest, but with lots of urls the data structure is growing big.

BTW: I am implementing a url-seen structure here and need the hash to check against, while minimizing false positives/negatives.
by BrowserUk (Pope) on May 31, 2007 at 18:17 UTC

Looking up Adler32 brought the following to light:

Weakness

Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol.

As most urls are considerably shorter than 128 bytes, that suggests that Adler32 is not a good choice for your application and that you would better consider the CRC32 algorithm.

Whichever algorithm you use there is going to be some chance of false positives. However, you can reduce these chances by a considerable margin by using two different hashing algorithms. If you were to calculate and store both the CRC32 and the Alder32 for each url, the odds of a simultaneous collision for both hashes for any given pair of urls is vastly reduced.

Of course that means storing twice as much information which is a part of your original problem. However, there is a way of storing both sets of hash data such that it requires minimal memory (10kb or so) whilst giving almost the same lookup performance (15 microsecs/lookup compared to 5 microsecs) as Perl's hashes.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
by BrowserUk (Pope) on May 31, 2007 at 16:51 UTC
by BrowserUk (Pope) on Jun 01, 2007 at 08:28 UTC

A final word. Don't use Alder32 (alone) for this purpose!

Running Adler, CRC32 and both on several sets of 1 million randomly generated url-like strings ranging from 16 to 128 characters in length, Adler produced duplicates in ~1% of cases; CRC32 produced ~0.2%; and in several runs the combination of both found just 2 duplicates (circa. 0.002% but not enough samples to be judged representative).

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://618494]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2021-12-03 17:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
R or B?

Results (29 votes). Check out past polls.

Notices?