in reply to MD5  what's the alternative
Take this with a large pinch of salt as IANAM, but...I think that the 'problem' is being overstated.
The upshot of it is that the time taken to find a piece of text that will produce the same md5 checksum is cut from a notional "few million years" to "a few days".
What seems to have been over looked is that there is no guarentee that the piece of text with the same checksum, is the same as the text that produced the checksum that is being attacked. In fact, it is most unlikely to be so.
It is fairly obvious that any hashing algorithm that is used to map any number of arbitrary length pieces of data to some fixed size number, will produce collisionslots of them. In fact, an infinite number of collisions!
But in order to 'crack' a given checksum, you don't need to find a piece of text that produces the sort after checksum. You need to find the piece that was used to produce the sort after checksum.
That means you need to produce every piece of text that can produce the given checksum, and then decide which of that (infinite number of possibles), is the piece that your trying to decode.
The only real risk with using md5 is that it is possible that you might generate the same checksum from two concurrently active sessions (or other use).
The answer to this is to generate your md5, look in your database to see if it is already active, and generate a new one (for example: add a random number that isn't used to the end of whatever you are encoding).
Rinse and repeat until you get an md5 that is unique within your database. Chances are in practice, this collision will rarely if ever happen, but whenit does, your code within then deal with it.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
Re^2: MD5  what's the alternative
by blm (Hermit) on Aug 27, 2004 at 07:24 UTC

It is fairly obvious that any hashing algorithm that is used to map any number of arbitrary length pieces of data to some fixed size number, will produce collisionslots of them. In fact, an infinite number of collisions!
Not trying to split hairs, but I would say a finite, large number but not infinite. There is a difference: if it was infinite you could keep finding collisions forever but if it is finite there would be a time that you would stop finding collisions.
And I don't think that it is overstated. It has often been said that there are collisions in MD5. Its strength was that it took a long time and/or alot of computing power to find the thing that generates a particular hash which i will call plaintext. Now that time is lessened. It doesn't matter if one finds the original plain text or an alternative the generates the same hash the computer will not know the difference. eg MD5 is used for passwords on linux machines. No matter what if the plaintext at logon matches the hash in the shadow file login will be granted whether it was "the original plaintext" or a "collision plaintext"
Update:See posts by BrowserUk and adrianh below. I was thinking the collisions were restricted to the set of possible hashes. Bizarre!
 [reply] 

a finite, large number but not infinite.
md5 hashing can be applied to inputs of any length. "Any length" means an infinite number of possible inputs. Adapting the old schoolboy definition of infinite, however many inputs you have hashed, you can always add one more byte to any one of them to get one more. Infinite.
Every input in this range of inputs will be mapped to one of the 2**128 possible md5s.
Therefore the number of inputs that will map to any given md5 is infinity / 2**128
Which is infinity.
It has often been said that there are collisions in MD5.
Of course there are. It could not be otherwise. Whenever the range of inputs is greater than the range of outputs, there will always be collisions. It simply is not possible for it to be otherwise. (Would that it were, it would make a brilliant compression algorithm:)
Theoretically, if you generated the md5s for all of the (16byte binary encoded) integers 0 .. 2**128 1, you might not find any duplicates. That is to say, the input range equals the output range, so it is possible that the md5 algorithm would be a 'perfect hash' for the input data.
And if you double the length of the input data to 32 bytes, then (theoretically) there will be exactly 2 inputs (plaintexts) for every output (md5).
I don't think that this "perfect hash" status has ever been confirmed (or even suggested). It would simply take too long to verify.
The example of duplicate plaintexts posted were 128 (8bit) bytes each. 2**1024 inputs / 2**128 outputs.
There will be at least 8 duplicate, 128byte inputs for every md5. That makes it much easier to find a duplicate.
As far as your Linux passwords are concerned, not only would the cracker need to generate a plaintext that gave the same MD5, it would also need to be of a length that the password verification process would accept as a password. Whilst accepting longer passwords generally gives greater security, that is only true up to a certain limit. With md5, that limit is 16 (8bit) bytes.
If the plaintexts where restricted to 32 x 8bit bytes, then there will be 2**256 inputs and 2**128 outputs. Therefore there *must be* at least 2 x 32byte inputs that will produce every md5.
If the plaintext is restricted to 16 x 8bit bytes, then it is possible (but not proven), that the are no duplicates.
If the inputs are restricted to 12 (8bit bytes), then it is very unlikely that it will be duplicates. If you further restrict that to 7bit ASCII, even less likelyhood. Exclude control characters, less again.
This is one of those cases where more isn't always better.
Actually, in my experience, more is very rarely better in computing:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
 [reply] [d/l] [select] 

Not trying to split hairs, but I would say a finite, large number but not infinite.
Erm. Nope. Map an infinite number of things into a finite number of boxes and you'll get an infinite number of collisions. Dividing infinity by N just gives you infinity, not "a finite, large number".
 [reply] 

I agree with you that it doesn't matter whether the attacker generates the "original plaintext" or some "collision plaintext". This is because for passwords, the original plaintext is not stored anywhere, just the hashed MD5 value of the password is stored. That means that the computer can only compare the MD5 value of the entered password with the MD5 value which has been stored. Any text which produces the same MD5 value will be accepted as the correct password. If the original plaintext was stored somewhere, then the attacker would only need to steal the file with the plaintext passwords in it; which is why the plaintext is not stored.
 [reply] 

 [reply] 



"In fact, an infinite number of collisions!"
...
"Not trying to split hairs, but I would say a finite, large number but not infinite."
Sigh.
md5 is a reasonably good hash. As it's input # grows, even as it approaches infinity, there are no numbers in it's range (output space) that cease to be 'hit.' So, theoretically, you can feed it an infinite number of consecutive inputs, and some subset of them they will give you an infinite number of collisions on any given point on the output space.
But we're not talking about math in theory, we're talking about math in the real world. There are limits, based on speed of computation, memory size, disk size, etc. Based on these, there is a finite (though very large) number of possible md5 sums calculable in any given timeframe  even if that timeframe is "from the advent of the abacus to the heat death of the universe, when there's no entropy generatable and no work can be done."
Less facetiously, I'd say that the difficulty of computing md5 sums from, say, >1 Terabyte inputs means that there will be a very low number of collisions from inputs that high. Why bother, when you can get a collision from underquadrupledigit bytes?
So, the answer is really 'both.' In theory, there's an infinite number of collisions for any md5 output. In practice, there certainly isn't, and the number of collisions that will be generated in our lifetimes is finite to the point of being understandable, and maybe even visualized, by our little human brains.
edit: more importantly, "Therefore the number of inputs that will map to any given md5 is infinity / 2**128" is incorrect. You're assuming even distribution from the domain to the range. This is not proven (otherwise given any consecutive set of (2**128)1 elements, they'd cover the range of md5 minus one, and adding one more would cover the range entirely. Not yet proven to be true, and in fact quite unlikely). So division doesn't follow, thus while your conclusion is correct your path to get there isn't.
 [reply] 

edit: more importantly, "Therefore the number of inputs that will map to any given md5 is infinity / 2**128" is incorrect. You're assuming even distribution from the domain to the range. This is not proven (otherwise given any consecutive set of (2**128)1 elements, they'd cover the range of md5 minus one, and adding one more would cover the range entirely. Not yet proven to be true, and in fact quite unlikely). So division doesn't follow, thus while your conclusion is correct your path to get there isn't.
I just noticed your edit.
I agree, that there is an implied (or should be) "on average" inserted into that statement.
However, even if the mapping of inputs to outputs is a less than even distribution, and the value "2**128" is less (it cannot be more), you still end up with a sum of infinity / somenumber, with the result that (on average) the number of (possible) messages that (would) map to any given md5, is infinity.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
 [reply] [d/l] 
Re^2: MD5  what's the alternative
by Aristotle (Chancellor) on Aug 29, 2004 at 21:28 UTC

You miss the point.
The situation in which the "panic" applies is that I have received a message, and I have a trusted MD5 checksum of the message originally sent. (In practice, the checksum is protected using public key cryptography.) The message I received hashes to the same MD5 checksum as that of the original message.
How certain can I be that the message has not been altered in transit?
If an attacker can find a collision in reasonable time, he can pad a modified (or completely different) version of the message such that it hashes to the original checksum, and I can no longer trust the message I received any more than I could without the checksum.
In other words, a cryptographic signature is worthless if the hashing function is weak.
And it seems that MD5 has turned out to be weak.
That doesn't make it entirely useless. There are many scenarios outside cryptographic signatures where it is still useful.
Makeshifts last the longest.
 [reply] 

No. I haven't missed the point.
I'm not sure where you got the quoted word "panic" from, but it wasn't any of my posts.
How certain can I be that the message has not been altered in transit?
 If the message you recieved is protected by PKC, then if the PKC is secure, so is the message.
 How did you receive the md5?
If it came inside the PK encrypted message, and the encyption is any good, how could the bad guys know it?
Let's just assume for a moment that you received your "trusted md5" via secure means. How would the bad guys know what is was in order to create a message that hashed to that same MD5?
If the MD5 was not transmitted to you by secure means, then it's an aweful lot easier to alter both the message and the md5.
... he can pad a modified (or completely different) version of the message such that it hashes to the original checksum, ...
This is completely wrong! The attack consists of altering bits of bytes of the original message to produce a duplicate message.
The results will be the same length as the original, with a few bits altered.
 The attacker does not get to choose which bits of which bytes get altered.
 He does not get to choose what they get altered too.
 The process is purely matehmatical.
Hence, if the message is plain text, it will show obvious signs of tampering. Wrong letters in words, accented characters that don't fit. It will probably look as though it has suffered from corruption in transit with a few bits having been dropped or switched. Chances are that the intent of the original message would be almost intact.
What the attacker cannot do, is change it to something specific.
The "weakness" in the md5 digital (not cryptographic) signature , certainly does not allow the attacker to "pad a modified (or completely different) version of the message such that it hashes to the original checksum".
If the message is binary, either compressed, encrypted or both, the effect of the bits changed by the mathematical manipulations, will likely render an executable unrunnable; a compressed file undecompressible; and an encrypted file undecryptable. These format being more more sensitive to random bit corruption than plain text.
Nowhere, in any of the publically available material that I have been able to access over the last couple of days does is suggest (or even hint) that it is possible to replace one message with another of entirely different meaning and then coerse it to produce the original md5. And I believe I've read everything available to read.
Your suggestion that this is possible, shows a distinct lack of understanding of the processes involved.
Please read the (rather extended) thread starting at 386470 and if your still convinced that I have missed the point then /msg me and we can continue this offline.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
 [reply] 

You missed the point again. Here's how cryptographic signatures work:
Alice has a pair of keys. The encryption key is secret, the decryption key is published.
Alice wants to send a message to Bob such that Bob can be sure the message has not been altered in transit.
To that end, she encrypts a hash of her message with her secret encryption key, attaches the encrypted hash to the message, and sends them both together to Bob, over the same channel. She does not need to encrypt her message. Bob can use Alice's published decryption key to decrypt the hash to verify the message against it, and can be sure that the message has not been tampered with..
If Eve intercepts the message, she cannot send Bob an altered message with a new hash, because it would have to be encrypted using Alice's secret.
Therefore, even though the message has been sent in the clear over an insecure channel, Bob can trust it as much as he trusts Alice's published key.
But if Eve can feasibly find a collision in the hash function, she doesn't have to know Alice's key; she can just pad the altered message such that it matches the hash previously calculated and encrypted by Alice. Bob can no longer trust the message any more than he could without the addition of the hash.
Makeshifts last the longest.
 [reply] 




