http://www.perlmonks.org?node_id=386246

kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

In reference to the node On showing the weakness in the MD5 digest function and getting bitten by scalar context by grinder.

I'm using Digest::MD5 to generate unique cookie values as well as for password authentication.

Should I or should I not continue to use Digest::MD5 for those operations? What's the alternative?

Thanks in anticipation :)

Replies are listed 'Best First'.
Re: MD5 - what's the alternative
by ikegami (Patriarch) on Aug 27, 2004 at 05:46 UTC
    You can use SHA-1 (Digest::SHA1) and the bigger versions of it SHA-256, SHA-384 and SHA-512 (collectively known as SHA-2) (Digest::SHA2). The interface should be nearly identical. In this case, more bits means more secure, more bytes to save, and longer computing times. Here seems a good place to start.
Re: MD5 - what's the alternative
by dws (Chancellor) on Aug 27, 2004 at 05:57 UTC

    I'm using Digest::MD5 to generate unique cookie values as well as for password authentication. Should I or should I not continue to use Digest::MD5 for those operations?

    The vulnerability is that if a 3rd party intercepts the MD5 hash, they can spend a few days of compute time to discover a plain-text input that will produce the same hash. You can mitigate or effectively eliminate the threat by limiting the length plaintext passwords (say, 12 to 16 characters). Then, even if an attacker finds a longer text string that results in the same MD5 hash, they're cut off by the limit.

    However, unless you're mixing user-supplied plaintext with some secret string before generating a hash, you're open to dictionary attacks.

    MD5 is way down on the list of things I'm worrying about right now, but there's always the chance that I'm being naive.

      You can also throw some private bits into the data stream before hashing.

      If user supplies PW, which gets hashed to PW', and a 3rd party gets PW', the weakness allows them to discover another password that also hashes to PW'.

      If you add additional bits to the supplied PW -- PWpri, and hash that to PWpri', which the attacker gets, and using the techniques described, comes up with some bits that also hash to PWpri', they still can't come up with a valid PW that when pri gets added to it also produce PWpri'.

      Of course, if your security is such that an attacker can discover PWpri', they can probably find pri out anyway, and you're probably owned at that point anyway, so discovering PW is the least of your concerns.

      I don't understand why you think limiting the length of plaintext passwords to say, 12 to 16 characters will mitigate or eliminate the threat. Surely all that would do is reduce the search space that the attacker has to try to find a matching MD5 hash, making it even easier and quicker to crack the system. That's unless you think that the attacker won't know that you are limiting password length, in which case, aren't you relying on "security through obscurity"? As we all should know, security through obscurity gives a false sense of security, rather than actual security.
        I think the idea is that if you want to find THE 16 character plaintext it takes 2^128 operations. The new vulnerability means you can find an equivalent (but longer) plaintext in 2^40. So if you limit the password to 16 characters then a longer plaintext with an identical hash is no use. That said, I could be completely wrong about the vulnerability always producing longer strings.
Re: MD5 - what's the alternative
by BrowserUk (Patriarch) on Aug 27, 2004 at 06:40 UTC

    Take this with a large pinch of salt as IANAM, but...I think that the 'problem' is being over-stated.

    The upshot of it is that the time taken to find a piece of text that will produce the same md5 checksum is cut from a notional "few million years" to "a few days".

    What seems to have been over looked is that there is no guarentee that the piece of text with the same checksum, is the same as the text that produced the checksum that is being attacked. In fact, it is most unlikely to be so.

    It is fairly obvious that any hashing algorithm that is used to map any number of arbitrary length pieces of data to some fixed size number, will produce collisions--lots of them. In fact, an infinite number of collisions!

    But in order to 'crack' a given checksum, you don't need to find a piece of text that produces the sort after checksum. You need to find the piece that was used to produce the sort after checksum.

    That means you need to produce every piece of text that can produce the given checksum, and then decide which of that (infinite number of possibles), is the piece that your trying to decode.

    The only real risk with using md5 is that it is possible that you might generate the same checksum from two concurrently active sessions (or other use).

    The answer to this is to generate your md5, look in your database to see if it is already active, and generate a new one (for example: add a random number that isn't used to the end of whatever you are encoding).

    Rinse and repeat until you get an md5 that is unique within your database. Chances are in practice, this collision will rarely if ever happen, but whenit does, your code within then deal with it.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      It is fairly obvious that any hashing algorithm that is used to map any number of arbitrary length pieces of data to some fixed size number, will produce collisions--lots of them. In fact, an infinite number of collisions!

      Not trying to split hairs, but I would say a finite, large number but not infinite. There is a difference: if it was infinite you could keep finding collisions forever but if it is finite there would be a time that you would stop finding collisions.

      And I don't think that it is overstated. It has often been said that there are collisions in MD5. Its strength was that it took a long time and/or alot of computing power to find the thing that generates a particular hash which i will call plaintext. Now that time is lessened. It doesn't matter if one finds the original plain text or an alternative the generates the same hash the computer will not know the difference. eg MD5 is used for passwords on linux machines. No matter what if the plaintext at logon matches the hash in the shadow file login will be granted whether it was "the original plaintext" or a "collision plaintext"

      Update:See posts by BrowserUk and adrianh below. I was thinking the collisions were restricted to the set of possible hashes. Bizarre!

        a finite, large number but not infinite.

        md5 hashing can be applied to inputs of any length. "Any length" means an infinite number of possible inputs. Adapting the old schoolboy definition of infinite, however many inputs you have hashed, you can always add one more byte to any one of them to get one more. Infinite.

        Every input in this range of inputs will be mapped to one of the 2**128 possible md5s.

        Therefore the number of inputs that will map to any given md5 is infinity / 2**128

        Which is infinity.


        It has often been said that there are collisions in MD5.

        Of course there are. It could not be otherwise. Whenever the range of inputs is greater than the range of outputs, there will always be collisions. It simply is not possible for it to be otherwise. (Would that it were, it would make a brilliant compression algorithm:)

        Theoretically, if you generated the md5s for all of the (16-byte binary encoded) integers 0 .. 2**128 -1, you might not find any duplicates. That is to say, the input range equals the output range, so it is possible that the md5 algorithm would be a 'perfect hash' for the input data.

        And if you double the length of the input data to 32 bytes, then (theoretically) there will be exactly 2 inputs (plaintexts) for every output (md5).

        I don't think that this "perfect hash" status has ever been confirmed (or even suggested). It would simply take too long to verify.

        The example of duplicate plaintexts posted were 128 (8-bit) bytes each. 2**1024 inputs / 2**128 outputs.

        There will be at least 8 duplicate, 128-byte inputs for every md5. That makes it much easier to find a duplicate.

        As far as your Linux passwords are concerned, not only would the cracker need to generate a plaintext that gave the same MD5, it would also need to be of a length that the password verification process would accept as a password. Whilst accepting longer passwords generally gives greater security, that is only true up to a certain limit. With md5, that limit is 16 (8-bit) bytes.

        If the plaintexts where restricted to 32 x 8-bit bytes, then there will be 2**256 inputs and 2**128 outputs. Therefore there *must be* at least 2 x 32-byte inputs that will produce every md5.

        If the plaintext is restricted to 16 x 8-bit bytes, then it is possible (but not proven), that the are no duplicates.

        If the inputs are restricted to 12 (8-bit bytes), then it is very unlikely that it will be duplicates. If you further restrict that to 7-bit ASCII, even less likelyhood. Exclude control characters, less again.

        This is one of those cases where more isn't always better.

        Actually, in my experience, more is very rarely better in computing:)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        Not trying to split hairs, but I would say a finite, large number but not infinite.

        Erm. Nope. Map an infinite number of things into a finite number of boxes and you'll get an infinite number of collisions. Dividing infinity by N just gives you infinity, not "a finite, large number".

        I agree with you that it doesn't matter whether the attacker generates the "original plaintext" or some "collision plaintext". This is because for passwords, the original plaintext is not stored anywhere, just the hashed MD5 value of the password is stored. That means that the computer can only compare the MD5 value of the entered password with the MD5 value which has been stored. Any text which produces the same MD5 value will be accepted as the correct password. If the original plaintext was stored somewhere, then the attacker would only need to steal the file with the plaintext passwords in it; which is why the plaintext is not stored.
        "In fact, an infinite number of collisions!"
        ...
        "Not trying to split hairs, but I would say a finite, large number but not infinite."

        Sigh.

        md5 is a reasonably good hash. As it's input # grows, even as it approaches infinity, there are no numbers in it's range (output space) that cease to be 'hit.' So, theoretically, you can feed it an infinite number of consecutive inputs, and some subset of them they will give you an infinite number of collisions on any given point on the output space.

        But we're not talking about math in theory, we're talking about math in the real world. There are limits, based on speed of computation, memory size, disk size, etc. Based on these, there is a finite (though very large) number of possible md5 sums calculable in any given timeframe - even if that timeframe is "from the advent of the abacus to the heat death of the universe, when there's no entropy generatable and no work can be done."

        Less facetiously, I'd say that the difficulty of computing md5 sums from, say, >1 Terabyte inputs means that there will be a very low number of collisions from inputs that high. Why bother, when you can get a collision from under-quadruple-digit bytes?

        So, the answer is really 'both.' In theory, there's an infinite number of collisions for any md5 output. In practice, there certainly isn't, and the number of collisions that will be generated in our lifetimes is finite to the point of being understandable, and maybe even visualized, by our little human brains.

        edit: more importantly, "Therefore the number of inputs that will map to any given md5 is infinity / 2**128" is incorrect. You're assuming even distribution from the domain to the range. This is not proven (otherwise given any consecutive set of (2**128)-1 elements, they'd cover the range of md5 minus one, and adding one more would cover the range entirely. Not yet proven to be true, and in fact quite unlikely). So division doesn't follow, thus while your conclusion is correct your path to get there isn't.

      You miss the point.

      The situation in which the "panic" applies is that I have received a message, and I have a trusted MD5 checksum of the message originally sent. (In practice, the checksum is protected using public key cryptography.) The message I received hashes to the same MD5 checksum as that of the original message.

      How certain can I be that the message has not been altered in transit?

      If an attacker can find a collision in reasonable time, he can pad a modified (or completely different) version of the message such that it hashes to the original checksum, and I can no longer trust the message I received any more than I could without the checksum.

      In other words, a cryptographic signature is worthless if the hashing function is weak.

      And it seems that MD5 has turned out to be weak.

      That doesn't make it entirely useless. There are many scenarios outside cryptographic signatures where it is still useful.

      Makeshifts last the longest.

        No. I haven't missed the point.

        I'm not sure where you got the quoted word "panic" from, but it wasn't any of my posts.

        How certain can I be that the message has not been altered in transit?
        1. If the message you recieved is protected by PKC, then if the PKC is secure, so is the message.
        2. How did you receive the md5?

          If it came inside the PK encrypted message, and the encyption is any good, how could the bad guys know it?

        Let's just assume for a moment that you received your "trusted md5" via secure means. How would the bad guys know what is was in order to create a message that hashed to that same MD5?

        If the MD5 was not transmitted to you by secure means, then it's an aweful lot easier to alter both the message and the md5.

        ... he can pad a modified (or completely different) version of the message such that it hashes to the original checksum, ...

        This is completely wrong! The attack consists of altering bits of bytes of the original message to produce a duplicate message.

        The results will be the same length as the original, with a few bits altered.

        1. The attacker does not get to choose which bits of which bytes get altered.
        2. He does not get to choose what they get altered too.
        3. The process is purely matehmatical.

        Hence, if the message is plain text, it will show obvious signs of tampering. Wrong letters in words, accented characters that don't fit. It will probably look as though it has suffered from corruption in transit with a few bits having been dropped or switched. Chances are that the intent of the original message would be almost intact.

        What the attacker cannot do, is change it to something specific.

        The "weakness" in the md5 digital (not cryptographic) signature , certainly does not allow the attacker to "pad a modified (or completely different) version of the message such that it hashes to the original checksum".

        If the message is binary, either compressed, encrypted or both, the effect of the bits changed by the mathematical manipulations, will likely render an executable unrunnable; a compressed file undecompressible; and an encrypted file undecryptable. These format being more more sensitive to random bit corruption than plain text.

        Nowhere, in any of the publically available material that I have been able to access over the last couple of days does is suggest (or even hint) that it is possible to replace one message with another of entirely different meaning and then coerse it to produce the original md5. And I believe I've read everything available to read.

        Your suggestion that this is possible, shows a distinct lack of understanding of the processes involved.

        Please read the (rather extended) thread starting at 386470 and if your still convinced that I have missed the point then /msg me and we can continue this offline.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: MD5 - what's the alternative
by dragonchild (Archbishop) on Aug 27, 2004 at 11:55 UTC
    Personally, I use MD5 to generate one-use authentication tokens. On the next request, I verify the token presented is one I'm expecting, delete it from the serverside database, and generate a new one-use token to be sent back with the response. I also check for collisions (but haven't had one in two weeks of banging on it), just for completeness.

    Personally, I think this is about as secure as it could be, given the nature of the Net, even using MD5. The token is only viable for a certain period of time if it's unused, after which it's scrubbed from the tables. If it is used, the token lives for a much shorter period of time.

    As for password authentication ... I'd probably use SHA-512 for that. The expense in the larger hash is offset by the fact that it only happens once/session. I generate tokens every page request, so it has to be cheaper, CPU-wise.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: MD5 - what's the alternative
by gmpassos (Priest) on Aug 28, 2004 at 17:49 UTC
    This depends in the way that you are using a MD5 "fingerprint".

    The most important thing is to know if you have a static or dynamic fingerprint. This means, if the output of MD5 will be generated (with different outputs) every day, or if you have a static fingerprint (like in passwd).

    If you have a dynamic fingerprint will be much more harder to crack it, but if it's static we have all the time that we need to crack it.

    Now about CRACK a MD5. Well, what you can know is that is impossible to get back the original text. Like we say, is a fingerprint, not all the body.

    What is possible to do with MD5, is to find by brute force a string that can produce again the fingerprint, but this doesn't mean that the password used by the user is that string found with brute force. In other words, we always have more than 1 string (actually much more) that produce the same fingerprint. This exists for any "digest" algorithm, the question is that MD5 is faster than SHA, and with MD5 we can use some tricks to reduce the number of attacks.

    So, use a dynamic fingerprint and always put with the original string some extra data, specially dynamic data, to make the brute force harder,

    Graciliano M. P.
    "Creativity is the expression of the liberty".