Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

On showing the weakness in the MD5 digest function and getting bitten by scalar context

by grinder (Bishop)
on Aug 26, 2004 at 22:18 UTC ( #386193=perlmeditation: print w/ replies, xml ) Need Help??

For those of you who have been living in a cave for the past few weeks (or without net access, voluntarily, or not, as in my case), you may not be aware that researchers have found a method that seriously weakens the usefulness of MD5, and, to a certain extent, SHA-1 digests.

MD5 produces a fixed 128 bit string from any given string. SHA-1 does the same except that the length of the bit string is 160 bits. The strength of the digests comes from the fact that if the digest of a given string (say, the text of this node) produces a certain digest. To find another string that maps to the same digest (i.e. a collision in the MD5-space) should theoretically take 2^128 steps. The research shows how to find a collision reliably in 2^40 steps. As someone points out:

The difference between those two numbers is several days, versus several million years.

Anyway, the point of this all is that while trying to find out more about the issue, I came across a weblog cum forum where someone posted a Perl program to show the flaw. Someone then produced an improved version that was more cross-platform (not relying on an external program).

The really funny part was when another participant posted the following reply:

Not being able to see the difference in the bytestreams, I decided to run the bytes through SHA1. When I saw that the SHA1s were the same hash... AND realized that the bytestreams were different... Well I almost poo'ed my pants.

Then I looked more closely at the perl code and realized that it has a fatal flaw in it. The program as written above finds the md5sum of '128', not the md5sum of the hex stream.

Yes folks, the person who wrote the original snippet created an array of 128 values, and assigned it to scalar. Thereby storing the length of the array, rather than its contents.

To find out more, take a look at Freedom to Tinker - Report from Crypto 2004 for more details.

Oh, and, it's time to stop using MD5. Seriously. If you ar using it for security (rather than just checking to see that your file was copied correctly), use a longer hash. Use SHA-512 if you can.

- another intruder with the mooring of the heat of the Perl

Comment on On showing the weakness in the MD5 digest function and getting bitten by scalar context
Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by ctilmes (Priest) on Aug 26, 2004 at 23:29 UTC
nitpick
by sleepingsquirrel (Hermit) on Aug 26, 2004 at 23:57 UTC
    Yes folks, the person who wrote the original snippet created an array of 128 values, and assigned it to scalar.
    Just a minor nitpick, it was the second poster who made the scalar/array context slip. (I'm the guy who posted the first code snippet;). True to my .sig, I didn't post my code until I knew I had the correct answer. In any case, the md5sum of the correct byte streams should be...
    79054025255fb1a26e4bc422aef54eb4
    
    As also noted on this page. Just for the heck of it, here's my original code. (and of course you should realize that it uses backticks to run a program called "echo" and "md5sum", hence the unix qualifier)
    #!/usr/bin/perl -w use strict; my $v1=<<END_V1; d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 87 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 71 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd f2 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 b4 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 a8 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 2b 6f f7 2a 70 END_V1 my $v2=<<END_V2; d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 07 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 f1 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd 72 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 34 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 28 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 ab 6f f7 2a 70 END_V2 my $p=join("",map {chr(hex($_))} split /\s+/, $v1); my $q=join("",map {chr(hex($_))} split /\s+/, $v2); print `echo -n \'$p\'|md5sum`; print `echo -n \'$q\'|md5sum`;


    -- All code is 100% tested and functional unless otherwise noted.
      Not having followed the recognization of MD5's weakness(es), it looks as if your two strings differ by the significant bit on the 20th, 30th, etc bytes. That looks like someone mathematically broke MD5. Now, wouldn't SHA-n have a similar problem, but with a much larger sample set? Or, is it because the algorithm took liberties it shouldn't have taken?

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        SHA-0 was shown to be weak by a similar technique, as well as a reduced form of SHA-1 (40 rounds instead of 80, IIRC). Since such discoveries tend to promote other deiscoveries along the same lines, there is cause to be distrustful of SHA-1. Don't Panic, but be distrustful.

        Basically, this is a good time to come up with a totally new hash algorithm, since most of the existing ones are based on MD4.

        "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by perlfan (Curate) on Aug 27, 2004 at 00:14 UTC
Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by tachyon (Chancellor) on Aug 27, 2004 at 08:14 UTC

    To find another string that maps to the same digest (i.e. a collision in the MD5-space) should theoretically take 2^128 steps.

    Actually that is right and wrong. Hash collisions conform to the Birthday Paradox which says that if you have 23 people in a room there is a 50% probability that 2 people will share a birthday (have a collision). This 'paradox' applies to hashes as well. For example you have a 50% probability of a collision in a 32 bit hash after a mere 77163 tries (roughly 2^16 ie nothing like 2^32)

    Update

    Corrected technical inexactitude. The birthday paradox is the correct answer if you simply want a collision. To get a specific collision with a given digest it is not.

    cheers

    tachyon

      Not a paradox, merely simple probability. However, it does assume that people are born with equal likelihood throughout the year. But its solution will lead to the thought of a "paradox" in that the number of people that must be in the room seems less than what you'd expect. The trick here is whether you're looking for any two birthdays in the room to match, or a specific match to your own birthday. For this node, I will consider the former.

      Something will happen, or it won't. There is a 100% chance of that, or we can say that the probability of something happening or not happening is 1.0, so we can say that the probability of people having the same birthday is 1 - "not the same birthday".

      This would make a nice practice program, but you will probably need to use logarithms to perform the calculations if the numbers start to overflow.

      Hope this helped,
      -v
      "Perl. There is no substitute."
        Technically, you could find a collision in 1 try using a Monte Carlo method, i.e. random guessing. The change is extremely small, but not impossible. This is the case for *all* hashing no matter how large the hash space is.
      Edit: tachyon updated to fix. Leaving my comment for posterity, but it no longer applies.

      Sorry Tachyon, but you're wrong. In roughly 2^16 tries, you've got 50% odds of finding a collision - not 50% odds of finding a collision with a given result. So you have 50% odds of two numbers in the 2^16 tries colliding, but you do not have 50% odds of a number in the 2^16 tries colliding with the result of one specific output.

      You've mistaken "collision with a given point" with "collision between any two points." ++ for bringing up the birthday paradox, anyways.

        Of course you are right about finding a specific collision. I have had non specific collisions on my mind recently.

        cheers

        tachyon

Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by zeitgheist (Novice) on Aug 27, 2004 at 08:55 UTC
    Take a look at this also (circa 1996)
Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by ctilmes (Priest) on Aug 27, 2004 at 12:12 UTC
    Using MD5 to verify file integrity is often used in conjunction with file compression such as gzip or bzip2.

    Finding an arbitrary set of bits that result in the same MD5 hash (that has been shown to be possible) is one thing, but finding a specific set of bits that result in the same MD5 hash and also the characteristic of being able to be successfully uncompressed is much different. Compounding that with being about to alter the code in a meaningful way that results in a malicious trojan horse makes the task sufficiently difficult that I'm not terribly worried about it at this point.

      No doubt. There are a LOT of namby-pamby Chicken Littles running around crying about MD5's weaknesses.

      It's a HASH, for crying out loud. It's not meant to be provably perfect at identifying unique data streams.

      • It is clear that the MD5 digest cannot be unique for every data stream, since it is potentially far shorter than the original data;
      • It is clear that an endless variety of data streams will have the SAME digest, because you can construct pretty much any data stream you want;
      • It is clear that some yahoo might even be able to make a couple of "falsified" data streams which produce the same MD5 hash;
      • It is clear that the MD5 hash is as mathematically valid as it ever has been, since the algorithm hasn't changed at all.

      Say you were expecting message M, with hash H. You instead get message N which also happens to hash to H.

      • If you were expecting M to be about 20 MB, what's the chances of N being exactly, or even approximately the same size?
      • If you were expecting M to be a tarball or other application data, what's the chances of N being uncompressable or otherwise parsable? That is, the falsified data also happens to conform to the protocol?
      • If you were expecting M to be executable, what's the chances of N being executable? That is, no introduction of obviously broken execution flaws?

      You're worried about MD5 digests for showing falsification of data, right? Where some attacker alters the message? I contend that it will be pretty darned hard to find a useful attack on a message while maintaining MD5 integrity.

      To Allied Commanders: Raiders Expected on Supply Lines in Sector 5. Keep on guard. --HQ
      To Allied Commanders: No Raids Reported on Supply Lines for Sector 5, 6, or 8. Let Freedom Reign. --HQ

      Until someone shows that you can (1) take any arbitrary data set M, (2) falsify it to data set N, by (3) modifying a limited portion of M in an application-useful way and (4) adding less than a gigabyte of additional data, and (5) still come out with M=>H and N=>H hash equivalence, I'll trust MD5, thanks.

      --
      [ e d @ h a l l e y . c c ]

        It's a HASH, for crying out loud. It's not meant to be provably perfect at identifying unique data streams.
        It IS meant to be computationally infeasible to find two messages with the same hash. If someone has found a way to do that, MD5 has failed at its original purpose.
        Until someone shows that you can (1) take any arbitrary data set M, (2) falsify it to data set N, by (3) modifying a limited portion of M in an application-useful way and (4) adding less than a gigabyte of additional data, and (5) still come out with M=>H and N=>H hash equivalence, I'll trust MD5, thanks.
        This is because you are ignorant. "I can't imagine an attack, therefore there is no attack."
      You can add arbitrary bytes to the end of a gzipped file and it will still uncompress to the same thing. This is how the first illegal prime was discovered.
Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by hardburn (Abbot) on Aug 27, 2004 at 13:11 UTC

    While it might be amusing to watch people run around in panic over this, I wish they'd stop. They've been told for years that they should avoid MD5; I'm just surprised this discovery wasn't made sooner.

    Also, I'm not sure on this point, but I don't think SHA-512 adds any security over SHA-1. It increases the size of the bitstream, which is useful for some applications, but finding collisions would take the same amount of time.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      While it might be amusing to watch people run around in panic over this, I wish they'd stop. They've been told for years that they should avoid MD5; I'm just surprised this discovery wasn't made sooner.
      Hear, hear.
      Also, I'm not sure on this point, but I don't think SHA-512 adds any security over SHA-1. It increases the size of the bitstream, which is useful for some applications, but finding collisions would take the same amount of time.
      Not sure I follow you here. If the best possible collision-finding attack is brute force, shouldn't a longer output translate directly to more work? Are you suggesting that there is a better-than-brute-force attack against SHA-512? I'd have to say that it seems likely that one will be discovered someday. This MD5 discovery shows how much we still have to learn about constructing hash functions.
        Just for the record... yes, I work at NIST, but not in the Security division. However, I use hashes at the core of my work - http://www.nsrl.nist.gov/collision.html

        The official statement is below :
        http://csrc.nist.gov/
        http://csrc.nist.gov/hash_standards_comments.pdf

        NIST Brief Comments on Recent Cryptanalytic Attacks on Secure Hashing Functions and the Continued Security Provided by SHA-1

        Cryptographic hash functions that compute a fixed size message digest from arbitrary size messages are widely used for many purposes in cryptography, including digital signatures. At the recent Crypto2004 conference, researchers announced that they had discovered a way to "break" a number of hash algorithms, including MD4, MD5, HAVAL-128, RIPEMD and the long superseded Federal Standard SHA-0 algorithm. The current Federal Information Processing Standard SHA-1 algorithm, which has been in effect since it replaced SHA-0 in 1994, was also analyzed, and a weakened variant was broken, but the full SHA-1 function was not broken and no collisions were found in SHA-1. The results presented so far on SHA-1 do not call its security into question. However, due to advances in technology, NIST plans to phase out of SHA-1 in favor of the larger and stronger hash functions (SHA-224, SHA-256, SHA-384 and SHA-512) by 2010. SHA-1 and the larger hash functions are specified in FIPS 180-2. For planning purposes by Federal agencies and others, note also that the use of other cryptographic algorithms of similar strength to SHA-1 will also be phased out in 2010.

        shouldn't a longer output translate directly to more work? Are you suggesting that there is a better-than-brute-force attack against SHA-512?

        I wasn't sure, but as I recalled, SHA-512 is useful when you need 512 bits of information, but for something other than security reasons. It'd be no more secure than if you had taken the orginal data, hashed it, flipped a bit in the hash, hashed that, and the concatonted the two hashes together into a value twice the size of orginal hash. It's bigger, but you could still cryptoanaylize the hash with as much work as it would take to get the orginal hash size.

        However, a lot of other things I've read seem to contrict what I thought I knew; SHA-512 might really be that much more secure, at least as far as brute-forcing goes.

        "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: On showing the weakness in the MD5 digest function and getting bitten by scalar context
by zentara (Archbishop) on Aug 27, 2004 at 14:15 UTC
    I'm no expert about md5 hashes, but from a practical viewpoint, does it really diminish the usefulness of the md5 hash? Say you can somehow spoof the md5 hash of some binary, will that 'spoofed' binary be able to do anything useful, like introduce trojans? I doubt it. Very arbitrary changes must be made to the binary to get it to match the md5sum, and the chances of that change being "useful" is also astronomical.

    It still boils down to the fact that the main weakness of md5sums, is the security of the database where you store them.


    I'm not really a human, but I play one on earth. flash japh

      If you're thinking about this from a mathmatical point of view, then this is a very big deal. Cryptographic hashes have the properties of being 1) hard to reverse, and 2) hard to find a collision. Since #2 is now violated, any algorithms that assumed #2 is true is now a broken algorithm (as well as any algorithms based on those algorithms, and so on).

      Does it change practical uses of hashes? Maybe. It depends on your application.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      will that 'spoofed' binary be able to do anything useful, like introduce trojans?
      All bets are off now. Before it was considered practically impossible to calculate a collision in MD5-hash-space. Now it is shown that this is not the case and that you can do so in (practically spreaking) finite time.

      Nobody knows if the colliding datastreams will have any useful content, but nobody can tell you the opposite either and in matters of security, you assume the worst (but hope for the best).

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Just thought I would point you to the following: http://www.win.tue.nl/hashclash/SoftIntCodeSign/ Granted they are trivial programs, but it does show the possibility of collisions in binaries. -Dustin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://386193]
Approved by gmax
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2014-09-15 05:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (145 votes), past polls