in reply to Re: MD5  what's the alternative in thread MD5  what's the alternative
It is fairly obvious that any hashing algorithm that is used to map any number of arbitrary length pieces of data to some fixed size number, will produce collisionslots of them. In fact, an infinite number of collisions!
Not trying to split hairs, but I would say a finite, large number but not infinite. There is a difference: if it was infinite you could keep finding collisions forever but if it is finite there would be a time that you would stop finding collisions.
And I don't think that it is overstated. It has often been said that there are collisions in MD5. Its strength was that it took a long time and/or alot of computing power to find the thing that generates a particular hash which i will call plaintext. Now that time is lessened. It doesn't matter if one finds the original plain text or an alternative the generates the same hash the computer will not know the difference. eg MD5 is used for passwords on linux machines. No matter what if the plaintext at logon matches the hash in the shadow file login will be granted whether it was "the original plaintext" or a "collision plaintext"
Update:See posts by BrowserUk and adrianh below. I was thinking the collisions were restricted to the set of possible hashes. Bizarre!
Re^3: MD5  what's the alternative
by BrowserUk (Patriarch) on Aug 27, 2004 at 08:41 UTC

a finite, large number but not infinite.
md5 hashing can be applied to inputs of any length. "Any length" means an infinite number of possible inputs. Adapting the old schoolboy definition of infinite, however many inputs you have hashed, you can always add one more byte to any one of them to get one more. Infinite.
Every input in this range of inputs will be mapped to one of the 2**128 possible md5s.
Therefore the number of inputs that will map to any given md5 is infinity / 2**128
Which is infinity.
It has often been said that there are collisions in MD5.
Of course there are. It could not be otherwise. Whenever the range of inputs is greater than the range of outputs, there will always be collisions. It simply is not possible for it to be otherwise. (Would that it were, it would make a brilliant compression algorithm:)
Theoretically, if you generated the md5s for all of the (16byte binary encoded) integers 0 .. 2**128 1, you might not find any duplicates. That is to say, the input range equals the output range, so it is possible that the md5 algorithm would be a 'perfect hash' for the input data.
And if you double the length of the input data to 32 bytes, then (theoretically) there will be exactly 2 inputs (plaintexts) for every output (md5).
I don't think that this "perfect hash" status has ever been confirmed (or even suggested). It would simply take too long to verify.
The example of duplicate plaintexts posted were 128 (8bit) bytes each. 2**1024 inputs / 2**128 outputs.
There will be at least 8 duplicate, 128byte inputs for every md5. That makes it much easier to find a duplicate.
As far as your Linux passwords are concerned, not only would the cracker need to generate a plaintext that gave the same MD5, it would also need to be of a length that the password verification process would accept as a password. Whilst accepting longer passwords generally gives greater security, that is only true up to a certain limit. With md5, that limit is 16 (8bit) bytes.
If the plaintexts where restricted to 32 x 8bit bytes, then there will be 2**256 inputs and 2**128 outputs. Therefore there *must be* at least 2 x 32byte inputs that will produce every md5.
If the plaintext is restricted to 16 x 8bit bytes, then it is possible (but not proven), that the are no duplicates.
If the inputs are restricted to 12 (8bit bytes), then it is very unlikely that it will be duplicates. If you further restrict that to 7bit ASCII, even less likelyhood. Exclude control characters, less again.
This is one of those cases where more isn't always better.
Actually, in my experience, more is very rarely better in computing:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
 [reply] [d/l] [select] 
Re^3: MD5  what's the alternative
by adrianh (Chancellor) on Aug 27, 2004 at 08:33 UTC

Not trying to split hairs, but I would say a finite, large number but not infinite.
Erm. Nope. Map an infinite number of things into a finite number of boxes and you'll get an infinite number of collisions. Dividing infinity by N just gives you infinity, not "a finite, large number".
 [reply] 
Re^3: MD5  what's the alternative
by beable (Friar) on Aug 27, 2004 at 08:17 UTC

I agree with you that it doesn't matter whether the attacker generates the "original plaintext" or some "collision plaintext". This is because for passwords, the original plaintext is not stored anywhere, just the hashed MD5 value of the password is stored. That means that the computer can only compare the MD5 value of the entered password with the MD5 value which has been stored. Any text which produces the same MD5 value will be accepted as the correct password. If the original plaintext was stored somewhere, then the attacker would only need to steal the file with the plaintext passwords in it; which is why the plaintext is not stored.
 [reply] 

 [reply] 

That's something else I was wondering about. Where is the attacker supposedly snooping that they can capture an MD5'd password? Why doesn't the attacker just snoop to capture the plaintext password instead? This "vulnerability" is very light on details.
 [reply] 

Re^3: MD5  what's the alternative
by danderson (Beadle) on Aug 27, 2004 at 21:15 UTC

"In fact, an infinite number of collisions!"
...
"Not trying to split hairs, but I would say a finite, large number but not infinite."
Sigh.
md5 is a reasonably good hash. As it's input # grows, even as it approaches infinity, there are no numbers in it's range (output space) that cease to be 'hit.' So, theoretically, you can feed it an infinite number of consecutive inputs, and some subset of them they will give you an infinite number of collisions on any given point on the output space.
But we're not talking about math in theory, we're talking about math in the real world. There are limits, based on speed of computation, memory size, disk size, etc. Based on these, there is a finite (though very large) number of possible md5 sums calculable in any given timeframe  even if that timeframe is "from the advent of the abacus to the heat death of the universe, when there's no entropy generatable and no work can be done."
Less facetiously, I'd say that the difficulty of computing md5 sums from, say, >1 Terabyte inputs means that there will be a very low number of collisions from inputs that high. Why bother, when you can get a collision from underquadrupledigit bytes?
So, the answer is really 'both.' In theory, there's an infinite number of collisions for any md5 output. In practice, there certainly isn't, and the number of collisions that will be generated in our lifetimes is finite to the point of being understandable, and maybe even visualized, by our little human brains.
edit: more importantly, "Therefore the number of inputs that will map to any given md5 is infinity / 2**128" is incorrect. You're assuming even distribution from the domain to the range. This is not proven (otherwise given any consecutive set of (2**128)1 elements, they'd cover the range of md5 minus one, and adding one more would cover the range entirely. Not yet proven to be true, and in fact quite unlikely). So division doesn't follow, thus while your conclusion is correct your path to get there isn't.  [reply] 

edit: more importantly, "Therefore the number of inputs that will map to any given md5 is infinity / 2**128" is incorrect. You're assuming even distribution from the domain to the range. This is not proven (otherwise given any consecutive set of (2**128)1 elements, they'd cover the range of md5 minus one, and adding one more would cover the range entirely. Not yet proven to be true, and in fact quite unlikely). So division doesn't follow, thus while your conclusion is correct your path to get there isn't.
I just noticed your edit.
I agree, that there is an implied (or should be) "on average" inserted into that statement.
However, even if the mapping of inputs to outputs is a less than even distribution, and the value "2**128" is less (it cannot be more), you still end up with a sum of infinity / somenumber, with the result that (on average) the number of (possible) messages that (would) map to any given md5, is infinity.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." David Dunham
"Think for yourself!"  Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side."  tachyon
 [reply] [d/l] 

