Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

How safe is truncating an MD5 digest string?

by lemmett (Sexton)
on Sep 10, 2001 at 22:20 UTC ( #111524=perlquestion: print w/replies, xml ) Need Help??
lemmett has asked for the wisdom of the Perl Monks concerning the following question:

Let me warn you ahead of time this is more of an algorithm question than a pure Perl question.

I recently got handed the job of maintaining a web application that registers visitors to a site for sweepstakes and free samples. It uses Apache::Session to store the visitor's information. This is the routine from that creates a session id to use as a key.

sub generate_id { return substr(MD5->hexhash(time(). {}. rand(). $$. blah'), 0, 16); }

To be a good key for this use on the web, I'm primarily worried about two things: is it unique, and how likely is it that given one key a person can guess another valid one.

The string normally returned from hexhash is 32 chars long. This routine is tossing 16 of those characters right out the window (64 of 128 bits).

If I convert to a base 64 representation for the 128 bit quantity (22 chars) I only lose 36 bits by truncation. If I do what seems the logical thing and override the truncation being done by the module, our current database column definitions are too short and several of the reporting tools have to be examined for bugs (best case) or partially rewritten (worst case). In order to get management sign-off for that, I need to present a pretty strong case for change.

My questions are:

  1. How much more likely is it that two data strings like those being passed into hexhash() will collide if you only use the first 64 bits rather than all 128? I can't find anything about predictability of MD5 substrings. Maybe I'm worried about nothing?
  2. Some of the applications we build require the users to type in a URL (containing a session id) that was sent to them on a postcard. Is there a reliable way in Perl to generate a shorter identifier that meets both the uniqueness and the "very difficult to guess" tests?

If I'm doing the math correctly, 28 bits gives me almost 270 million buckets (a very rough 1 to 1 mapping to the 285 million in the US). I could code that in hex as 7 characters or base64 as 5 chars (although then the user has to get the case of the letters correct when they type them in).

Maybe a better second question would be:
"Does anyone know where I can find a digest function that hashes uniformly into a 28-32 bit address space?"

BTW, if this offends you because it's not directly a Perl question, I'd love to take it elsewhere but I don't have access to newsgroups/IRC from work and don't know of a more appropriate place to ask. If you do, please suggest it.

Replies are listed 'Best First'.
Re (tilly) 1: How safe is truncating an MD5 digest string?
by tilly (Archbishop) on Sep 11, 2001 at 00:17 UTC
    As long as some of the private information is not guessable, the output should be impossible for other people to predict. With $N possible output states, if $m and $n are much smaller than $N, then the naive probability that $m guesses manages to match one of $n valid keys is approximately $m*$n/$N. So with 2**64 possible states, you would expect to see collisions between 4 billion states and 4 billion guesses. Drop either of those by an order of magnitude, and you probably have no collisions. Increase by an order of magnitude, and you probably have lots of them.

    This estimate is off because it disregards the possibility of multiple collisions, which cannot be discounted if overall a single collision becomes reasonably likely. A substantially less naive approximation uses the Poisson distribution, and says that the probability of a collision is 1 - exp(-$m*$n/$N). This is off becaue it discounts the extent to which $m guesses exhausts the overall search space, which effect in this case affects the result a few decimal places down, but does not matter.

    That estimate is therefore the one you should quote in estimating the effort it would take to create a probability of compromising your system.

    Personally I would use the 64-bit representation and do a tr to convert the two non-URL safe characters to URL-safe ones. Sure it might be overkill. But it is extra safety for free, why not take it?

Re: How safe is truncating an MD5 digest string?
by Anonymous Monk on Sep 10, 2001 at 23:25 UTC
    There is nothing wrong with shortening an md5 hash. Obviously it increases the possibility of a collision and you'll have to determine how many strings you are going to hash combined with the total number of possible outcomes for your chosen bitlength to determine the odds of a collision, but if you are using the md5 hash as a session identifier (as opposed to using it as a password hash -- that is, if someone gives you the md5 hash and you use that to look up some data rather than someone giving you some data and you running md5 on that data) you can loop until you have generated a unique id by adding a random character onto the end of the data being hashed.
Re: How safe is truncating an MD5 digest string?
by jepri (Parson) on Sep 10, 2001 at 23:39 UTC
    In addition, there should be no problems shortening a block like that in regards to predictability. Hashses are not supposed to have any patterns, and a hash that had the first half of its bits the same on subsequent callings would be a fairly poor hash

    I didn't believe in evil until I dated it.

Re: How safe is truncating an MD5 digest string?
by John M. Dlugosz (Monsignor) on Sep 10, 2001 at 23:45 UTC
    I beleive that the nature of a good cryptographic hash function is that the output appears totally random and uniform. The probability of a match is directly related to the number of bits you keep. So, if you keep 64 bits, you have a 2**-64 probability of a match. That's a whole lot more probable than 2**-128, but still pretty rare on your scale.

    a digest function that hashes uniformly into a 28-32 bit address space? Try CRC-32. Or pick n bits from MD5 or SHA1.

      CRC-32 is a bad idea. It's completely linear, which makes it easy to attack. Given a CRC of some data, it's not hard to compute the CRC of some other data that's mostly the same. Cryptographic hash functions like MD5 have nonlinear steps that make this difficult.

      The reason hash functions produce such long outputs is to resist birthday attacks. That's where someone finds two hash inputs that result in the same output. It sounds like your system won't be vulnerable to a birthday attack, though, since the users don't pick the input to the hash function - you pick it for them. I have to echo everyone else and say, "it's probably ok to shorten MD5."

      BTW, the name "birthday attack" comes from the observation that, if you walk into a room containing 20 people, it's unlikely that one of them will have the same birthday as you. However, it's fairly likely that two of them will have the same birthday as each other.

        Are you saying that given a size and a target CRC checksum other than zero, it's easy to compose a message of length size that produces the target checksum?

        Making a small change to the data, including changing one bit, should produce a totally different checksum, since that's what it was designed to do in the first place.


      Perhaps I am mistaken, but I believe some care should be taken in calculating the collision math when truncating the hash.


      If you use only 64 bits, the probability of a collision is not 1 in 2**64. Rather, it is dependent upon both the number of bits retained and the number of generated keys. For 1 million iterations, the probability of a collision is:

      P(n) = 1 - e ** -(((10**6)**2)/(2*2**64)) = 2.7 E -8

      which is much higher than 1/2**64 = 5.4 E -20

      This may not seem significant, but what happens if you select only 36 bits from the hash (9 characters)? You are nearly guaranteed a collision in 1 million iterations:

      P(n) = 1 - e ** -(((10**6)**2)/(2*2**36)) = 0.999

Re: How safe is truncating an MD5 digest string?
by grinder (Bishop) on Sep 11, 2001 at 01:13 UTC

    Over and above the other suggestions given so far, I would add that if you were worried about reducing the digest's randomness, then take the 16-byte binary value of the digest, split it into two 8-byte values, xor them together, and then produce a base64 digest from that.

    You might also consider just lopping off four bytes and and xoring them into the remaining bytes (or not). Either way, that would give you 1/2**96 chance of collision, something like 79 228 162 514 264 337 593 543 950 336, which would appear to be a reasonable margin of safety...

    g r i n d e r
Re: How safe is truncating an MD5 digest string?
by demerphq (Chancellor) on Sep 11, 2001 at 00:17 UTC
    As far as I know changing one bit of the input should cause a "random" %50 of the bits in the digest to flip as well. It should be fine.
    Incidentally I like the idea of using the stringified form of {} as part of your hash. Although maybe you might want to use substr({},7,7). Hmm. well I just checked this and on two runs I got the same value. Have you double checked that this works the way you think it does? I suspect that it might be reusing the same value each time.
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)
Re: How safe is truncating an MD5 digest string?
by projekt21 (Friar) on Sep 11, 2001 at 15:46 UTC
How to specify a UID with a typable URL?
by TGI (Parson) on Sep 12, 2001 at 20:56 UTC

    Getting someone to type in 16 randon characters isn't going to be easy, especially if you want (need) accuracy. Recent versions of PGP get around this by having a dictionary of words used to represent groups of bits, so when a user tells someone their fingerprint, the reciever can then jot down the words instead of a long easy to mix up string of numbers. If the technique isn't patented, it may be worth a try.

    You'll need to ballance the size of the dictionary against how much you want people to type. If you use four words, then each word needs to uniquely specify 16 bits, which means a wordlist of over 65,000 words. Requiring 8 words gives you a dictionary of only 256 words that represent 8 bits each. If you use a number like 6 words (11 bits each) just pad the code with zeros to get your words.

    0000 0000 : happy
    0000 0001 : joy
    0000 0010 : smell
    0000 0011 : trust
    1010 1000 : horse

    So a user typing in an url might type:

    Make sure your dictionary is filtered for unpleasant words, easily mispelled words, homophones and near homophones (hear and here, then and than).

    Good luck.

    TGI says moo

Re: How safe is truncating an MD5 digest string?
by no_slogan (Deacon) on Nov 16, 2001 at 01:06 UTC
    This is an old thread, but in case anyone is still interested, I ran across some possibly useful information.

    RFC 2104 makes some recommendations on shortening hash output for use with message authentication codes (MACs). They seem to think 80 bits is enough for that purpose. The SSH transport layer allows a 96-bit shortened MAC. This stuff is black magic. Of course, the original question wasn't really about a MAC anyway, but it seems related.

    Here's an article which tells why you shouldn't use mod_usertrack for authentication. They suggest using Apache::Cookie::Encrypted. I've never tried it, myself.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://111524]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2017-03-24 03:18 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (295 votes). Check out past polls.