Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Using MD5 and the theory behind it

by r.joseph (Hermit)
on Jan 10, 2001 at 05:34 UTC ( [id://50831]=perlquestion: print w/replies, xml ) Need Help??

r.joseph has asked for the wisdom of the Perl Monks concerning the following question:

I want to use Digest::MD5 for a lot of things lately, mainly authenticating a source (making sure it is from me or one of my programs) and such, and I have read through the perldocs on the module, but I want to know the specifics of how MD5 works, and how I would use it in my situation. I am not really asking for code here, but more of a explanation of what MD5 is good for, how it is used, and such. Can anyone point me to some good FAQ's on the 'net about how MD5 is used and such?

Basicially, I want to know what MD5 is good for and how it works from a theoretical standpoint, and how all the parts of the puzzle fit together. I understand that this is not in fact a perl question, but I figured that there are many smart people inhabiting the monestary, so I should get some highly intelligent and helpful responses. Thanks everyone!! R.Joseph

Replies are listed 'Best First'.
Re: Using MD5 and the theory behind it
by arturo (Vicar) on Jan 10, 2001 at 06:46 UTC

    What MD5 does: takes some data, and turns it into a 128-bit 'digest' (also 'hash' or 'sum'). The 'hashing' is, for most practical purposes, one-way: given the digest, it is at least computationally prohibitive to reconstruct the original data (and different sets of data can have the same digest; so even if you found one that worked, you would have no reason to think it was what originally went in).

    Think of the digest as the data's fingerprint, and you'll have the basic idea.

    the gory details

    A lot of systems store md5 digests of passwords: you can't reconstruct the password if you know the md5'd 'hash' of it. So instead of having a file that stores all of your users' passwords in plain text, you store the MD5'd versions of them. Then, on login, you md5 what they type in and compare that value to what's stored in your password file. Note: this isn't a perfect system -- if someone steals your password files, they can 'brute force' your users passwords by cycling through all possible strings, taking their md5 'sums' (digests) and comparing the values they get to what's in the file. You make this harder by putting non-alphanumeric characters and a mixture of cases in your passwords -- this increases the numbers of possibilities they have to loop through, making the 'brute forcing' more computationally expensive.

    Another use is verifying that a bunch of data hasn't been modified without copying the data verbatim. Say, for example, you download suspicious tarball foo.tar.gz, and you want to know whether it's the *real* foo.tar.gz; you put it through the md5 algorithm and compare the result to the md5 "signature" you got from a *trusted* source, i.e. one you know was generated by the person who distributed the genuine foo.tar.gz. If the digests match, you can be certain for all reasonable purposes that you've got the real deal.

    Similarly: if you've been cracked, and you want to know whether crucial files on your system were modified, you would have (before you were cracked) made md5 digests out of your crucial system binaries (e.g. everything in /usr/bin/ on a *nix system), and stored those digests in a secure location. You could then run a check: if the result obtained by running MD5 on the binary doesn't agree with the stored value, then the file's been modified. (I believe the tripwire security utility uses such a method).

    On the web, you might use MD5 to verify a time-limited login. When a user logs in, you make an md5 'hash' (digest) out of their login name, password, a timestamp, and some secret key only you know, and set that as a value of a browser cookie. Then, on each request made by a user, you can make sure the user hasn't just copied over an old cookie (i.e. hasn't gone through your login procedure). and isn't some evil person trying to steal the legitimate user's identity by reading the user's cookie file and stealing, the user's password). You do this by comparing the value of the cookie to the value you compute on the spot out of the user's password, user name, legit timestamp, and the secret value.

    (notice, this also keeps you from passing passwords over the connection in cleartext on every request: the contents of a cookie can be read by a clever enough cracker. But the md5 digest won't do them any good as far as stealing passwords (unless they're sniffing during the login process!))

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

      <CITE>On the web, you might use MD5 to verify a time-limited login. When a user logs in, you make an md5 'hash' (digest) out of their login name, password, a timestamp, and some secret key only you know, and set that as a value of a browser cookie....</CITE>

      Why to do this kind of terrible thing?
      On my site, I just generate a random session ID and set that as a cookie. On server-side a make association of this session id with user's login name and a 'last access' timestamp. When the user returns, I just check the validity of session Id by following the association.

      This has several advatages:

      1. I do not have to compute (slow) MD5 on every request.
      2. Cookie value (SessionId) is random, that means totaly secure. It is not based on user's password or username.
      3. This will accomodate any authentication scheme (e.g. X.509 certs) not just plain passwords.

        One way I've used hashes is to set the verified user's cookie to be something like:

        $cookie = $user_id . $delimiter . hash( $user_id, "host secret passwor +d" );

        So on subsequent user accesses, all I need to do is split on the $delimiter and run $user_id, "host secret password" through the hashing algorithm and compare against the hash in the cookie to verify the user.

        I haven't look at the code at everydevel, but it looks like perlmonks does something similar to this.

        This trades a (slow) database access for a (slow) hash computation, so I'm not sure if there's a real winner (or if there is, it'll be system-dependent.) Just another option to consider ...

        Your method is not 'totally secure' because you have to store the nonce in a database. If you generate a SID from an MD5 digest based on user authentication information, this hash does not have to be stored. It can be generated when the cookie is inspected.

        Also if you run a large site with millions of users, your source of entropy can be depleated quickly, negating any security you would have gained.

      Thanks a ton...your reply was exactly what I was looking for - practical applications to help me understand. I even printed it out, for future reference. Nothing important in this message, just wanted to say thanks.
      R.Joseph
Re: Using MD5 and the theory behind it
by lhoward (Vicar) on Jan 10, 2001 at 06:19 UTC
    MD5 (and other one-way hash functions like CRC32) are designed to take in a string and convert it to a shorter string, kind of a fingerprint of the original string. Diffrent one-way hash functions produce fingerprints of diffrent lengths. But the following criteria should hold for all good one-way hash functions:
    • you can not learn anything about the input string by examining its fingerprint except for the fact that it has that fingerprint
    • a small change (even a single bit) in the input string should cause a dramatic change in the output of the hash function

    I deal with a good bit of datacomm and file transfers. I use MD5 to identify when I have received suspect duplicate files. I keep a DB table with the MD5 values of all the files that have been transmitted to me. Whenever I get a new file, I compare its MD5 valye to those stored in the table. If the value is not in the table, I process the file and store its MD5 value in the table. If the value is in the table I set the file asside for special handling and notify an operator.

    If you really want to learn about exactly how the (and other hash algorighms) work I recomend checking out Applied Cryptography by Bruce Schneier.

      You say that you 'compare its MD5' value to the values in a table. How do you get an MD5 value for a file? What exactly do you mean by this process (I believe that this process is very similar to the one that I am attempting). Thanks for the help!

        For reasonable-sized files (ones that fit comfortably in system memory): load the file's contents into a perl scalar, say $foo. Then $fingerprint = md5($foo);

        If you look through the documentation you have for it, you'll get some advice on other methods; e.g. (the object-oriented versions) :

        my $file ="/file/to/hash"; my $md5 = Digest::MD5->new(); $md5->addfile($file); $md5->add("seekrit passwerd"); # not the best choice for one, but ... my $digest = $md5->digest;

        I got this straight out of the docs, more or less. HTH

        Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Example use of MD5: making a MAC
by saucepan (Scribe) on Jan 10, 2001 at 11:19 UTC
    Here's a concrete example of using MD5 to create a message authentication code.

    Say you are writing a CGI script that plays a game with the user. You want to keep score, and give a prize of some kind to the first player who wins 100 games.

    You could keep a list of players and their scores on the server, but this is complex and costly if there are a very large number of players (or a large number of them playing at once), and you don't want to waste server side storage on the vast majority of users who are expected to play two or three games and then give up.

    It would be nice if you could keep their current score in a cookie, but then what is to stop someone from editing their cookies.txt file and setting their score to 99? This is where the MAC comes in:

    use Digest::MD5; # Given a message and key, returns a message authentication code # with the following properties relevant to our example: # - a 22-character string that may contain + / 0-9 a-z A-Z # - any given message and key will always produce the same MAC # - if you don't know the key, it's very hard to guess it # even if you have a message, its MAC, and this source code # - if you have a message, its MAC, and even the key, it's # very hard to find a different message with the same MAC # - even a tiny change to a message, including adding on to # the end of it, will produce a very different MAC sub compute_mac { my ($message, $key) = @_; Digest::MD5::md5_base64($key, Digest::MD5::md5($key, $message)); } # Load a secret key string from somewhere safe my $secret = 'skS>DrF1d:R-6<g8qmm7@Ml}?JQD1C'; # Ensures that an integer score is decorated with it's MAC sub authenticated_score { my $score = shift; my $mac = compute_mac(int($score), $secret); "$score/$mac"; }
    The authenticated_score() sub can be used to decorate a score with a code that's dependent upon both the score and your secret string. Just before you give out a score cookie to a player, run it through authenticated_score() to add the MAC:

    use CGI; use CGI::Cookie; my $score = 1; $score = authenticated_score($score); my $cookie = CGI::Cookie->new(-name => 'score', -value => $score); print header(-cookie=>$cookie);
    Now, when someone presents a score cookie, you can check the MAC to see whether the score is one you handed out or an impostor:

    my %cookies = CGI::Cookie->fetch; $score = $cookies{score}->value; # Eliminate any score that's been tampered with $score = 0 unless $score eq authenticated_score($score);
    Of course, a real program would probably want to do things in a different order:

    my %cookies = CGI::Cookie->fetch; my $score = $cookies{score}->value; $score = 0 unless $score eq authenticated_score($score); # (play game here, adding 1 to $score if this is a win) log_winner() if $score >= 100; $score = authenticated_score($score); my $cookie = CGI::Cookie->new(-name => 'score', -value => $score); print header(-cookie=>$cookie); # (send the rest of your HTML to the player.)
    Hmm, this turned out to be kind of long for a comment. But I spent so long on it I'm going to post it anyway, right after I mention that in a real program you might want to use CGI::EncryptForm instead of doing all this work yourself. :)

Re: Using MD5 and the theory behind it
by lzcd (Pilgrim) on Jan 10, 2001 at 06:24 UTC
    I'd suggest wandering down to local bookshop (Yep somethings still aren't on the net ;) ) and pick up a copy of Bruce Schneier's book 'Secrets and lies'.

    He also produces a monthly web-zine called CryptoGram that discusses such topics in 'real world' depth.

    RSA also some decent explainations of the technology/maths supporting it in their Tech Library.

    Probably the best advice that seems to be floating around right at the moment is:
    - Know exactly what you're attempting to trust (as opposed the very rare occation of 'who')
    - Realise that no form of encryption is a perfect defence for all. If you assume that anything you create can be broken then you're more likely to build a better system.

    Enjoy.
Re: Using MD5 and the theory behind it
by gildir (Pilgrim) on Jan 10, 2001 at 14:21 UTC
Re: Using MD5 and the theory behind it
by knight (Friar) on Jan 10, 2001 at 17:41 UTC
    Another use: Cons and Cook, two alternatives to Make, uses MD5 signatures to determine if the contents of a source file have changed since the last time a build was performed. This makes builds a lot more reliable than Make's use of file timestamps.

    If anyone does care to look at code to see how the MD5 internals work, there's a Digest::Perl::MD5 module that implements the algorithm in pure Perl.
Re: Using MD5 and the theory behind it
by mr.nick (Chaplain) on Jan 10, 2001 at 18:56 UTC
    I'll give you an example of usage from my own real-world. I maintain a database of MP3's that I use for broadcasting. Every 2 hours a perl script (loadmusic) runs through my MP3's looking for new, updated, moved and deleted files.

    To determine if an MP3 is the same, I used a Md5 checksum of the file. That way I can apply the following logic:

    • Same MD5, same directory + filename: the file hasn't changed
    • Same MD5, different directory + filename: the file is in the database, but has just moved to another location (no need to re-add it)
    • Different MD5, same directory + filename: the file has been updated (ID3 tags might have changed, or the last time it was scanned, it wasn't complete (download in progress from Napster)).
    • MD5 doesn't exist in database and filename + directory doesn't exist in database: new file!
    • In the database, the MD5 specified doesn't match a record and the filename + directory name doesn't exist: file has been deleted! Remove from DB

    So, I use it to "link" files on the HD to entries in the MD5. Since the MD5 sum is unique for every file, it works as the perfect identifier (ed.).

    In response to ichimunki: Absolutely correct! Of course what I meant to say was "virtually unique" :)

      Although I'm certain that this approach works, and will continue to work, MD5 sums are not unique for every file. If they were, this would be the ultimate compression algorithm (that is, if the MD5 were unique, you could use it to reverse engineer the file using only the hash because each hash have only one possible antecedent). The odds of two similar files having the same MD5 sum, however, is very low.
        Using one of these approximations, it looks like the probability of a birthday collision will finally hit 0.5 by about the time mr.nick has processed his 22 million million millionth MP3, so I'd agree that he has nothing to worry about for now. ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://50831]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-20 16:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found