http://www.perlmonks.org?node_id=386336


in reply to On showing the weakness in the MD5 digest function and getting bitten by scalar context

Using MD5 to verify file integrity is often used in conjunction with file compression such as gzip or bzip2.

Finding an arbitrary set of bits that result in the same MD5 hash (that has been shown to be possible) is one thing, but finding a specific set of bits that result in the same MD5 hash and also the characteristic of being able to be successfully uncompressed is much different. Compounding that with being about to alter the code in a meaningful way that results in a malicious trojan horse makes the task sufficiently difficult that I'm not terribly worried about it at this point.

Replies are listed 'Best First'.
Re^2: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by halley (Prior) on Aug 27, 2004 at 18:37 UTC
    No doubt. There are a LOT of namby-pamby Chicken Littles running around crying about MD5's weaknesses.

    It's a HASH, for crying out loud. It's not meant to be provably perfect at identifying unique data streams.

    • It is clear that the MD5 digest cannot be unique for every data stream, since it is potentially far shorter than the original data;
    • It is clear that an endless variety of data streams will have the SAME digest, because you can construct pretty much any data stream you want;
    • It is clear that some yahoo might even be able to make a couple of "falsified" data streams which produce the same MD5 hash;
    • It is clear that the MD5 hash is as mathematically valid as it ever has been, since the algorithm hasn't changed at all.

    Say you were expecting message M, with hash H. You instead get message N which also happens to hash to H.

    • If you were expecting M to be about 20 MB, what's the chances of N being exactly, or even approximately the same size?
    • If you were expecting M to be a tarball or other application data, what's the chances of N being uncompressable or otherwise parsable? That is, the falsified data also happens to conform to the protocol?
    • If you were expecting M to be executable, what's the chances of N being executable? That is, no introduction of obviously broken execution flaws?

    You're worried about MD5 digests for showing falsification of data, right? Where some attacker alters the message? I contend that it will be pretty darned hard to find a useful attack on a message while maintaining MD5 integrity.

    To Allied Commanders: Raiders Expected on Supply Lines in Sector 5. Keep on guard. --HQ
    To Allied Commanders: No Raids Reported on Supply Lines for Sector 5, 6, or 8. Let Freedom Reign. --HQ

    Until someone shows that you can (1) take any arbitrary data set M, (2) falsify it to data set N, by (3) modifying a limited portion of M in an application-useful way and (4) adding less than a gigabyte of additional data, and (5) still come out with M=>H and N=>H hash equivalence, I'll trust MD5, thanks.

    --
    [ e d @ h a l l e y . c c ]

      It's a HASH, for crying out loud. It's not meant to be provably perfect at identifying unique data streams.
      It IS meant to be computationally infeasible to find two messages with the same hash. If someone has found a way to do that, MD5 has failed at its original purpose.
      Until someone shows that you can (1) take any arbitrary data set M, (2) falsify it to data set N, by (3) modifying a limited portion of M in an application-useful way and (4) adding less than a gigabyte of additional data, and (5) still come out with M=>H and N=>H hash equivalence, I'll trust MD5, thanks.
      This is because you are ignorant. "I can't imagine an attack, therefore there is no attack."
        t IS meant to be computationally infeasible to find two messages with the same hash.

        Total bollocks!

        It is mathematically IMPOSSIBLE to reduce every 32 byte message, to a 16-byte hash and and not have duplicates. If you believe otherwise I have an infinite compression algorithm that you might like to fund development of?

        The only ignorance is that on behalf of those that don't understand simple logic--like you apparently.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re^2: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by jdalbec (Deacon) on Aug 29, 2004 at 02:49 UTC
    You can add arbitrary bytes to the end of a gzipped file and it will still uncompress to the same thing. This is how the first illegal prime was discovered.