Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

[OT]:Faster signature algorithm than md5?

by swampyankee (Parson)
on Sep 01, 2012 at 15:07 UTC ( #991162=perlquestion: print w/replies, xml ) Need Help??
swampyankee has asked for the wisdom of the Perl Monks concerning the following question:

I've written a little script which will check for duplicate files by walking down my file system. That part's no problem. What is a problem is how long it takes to get md5 signatures of large files, a couple of which are zipped tar files I'm using for backups.

Right now, I'm simply skipping files that are too big (2**24 bytes or larger), which is inelegant.

So, question 1 is how does md5's execution time scale with file size? (I would expect linearly, but I'm not sure)

Question 2: Is there a similarly reliable but quicker algorithm to get a file's signature?

I'm using the md5 program that came with my computer, which is a MacBook with a 2.1 Ghz Intel Core 2 processor, 1 BG RAM, and Mac OS X 10.7.4 (don't laugh; it was free ;))

Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

  • Comment on [OT]:Faster signature algorithm than md5?

Replies are listed 'Best First'.
Re: [OT]:Faster signature algorithm than md5?
by davido (Archbishop) on Sep 01, 2012 at 17:17 UTC

    1. MD5 scales linearly (so does SHA-1, by the way).
    2. You will trade reliability for speed, I'm afraid. While adding up all the bytes is also a linear algorithm, it's a faster algorithm, but quite unreliable for your purposes.

    Rather than skipping large files, just limit the size of the portion of the file you calculate an MD5 on. Limit it to, for example, 2^16. You will have false-positive collisions. But they will be relatively infrequent. For each of collisions, go back and hash the entire file.

    Now in the unlikely event that you get a second round of collisions, you can do a byte by byte comparison. So the steps would be:

    1. For each file read the first 64KB or 128KB ($/ will be helpful) and generate a hash.
    2. For each collision read the entire file and generate a hash.
    3. For each whole-file-hash collision, do a byte by byte comparison.

    Git uses the SHA-1 algorithm to uniquely identify commits, as well as objects under its purview. I think that SHA-1 was chosen over MD5 because it has a smoother distribution (less likely for pathological datasets to have a higher than average possibility of causing a collision). Though still linear, it's a slower algorithm than MD5. But if you intend to follow the suggestion of only hashing the first 64, 128, or 256k of each file on first pass, speed shouldn't be such an issue, and reliability improves to the point that you may never have to do a byte-by-byte comparison as a third pass.

    If you switch to the slower SHA1 algorithm, and follow the steps above, I doubt you would ever exercise step three except for when you come across truly identical files, which would produce a collision at step three also. Theoretically it's possible to get a SHA1 false positive collision, but the distribution is pretty smooth, and 1/1000000000000000000000000000000000000000000000 is a really small probability.


Re: [OT]:Faster signature algorithm than md5? (4x faster)
by BrowserUk (Pope) on Sep 01, 2012 at 17:55 UTC

    As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO

    Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file:

    sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 1024**2; } return $md5; }

    This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases.

    So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy.

    (Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.)

    A simple test script:

    And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect:

    C:\test>md5t 500MB.csv Processing 500MB.csv : 536870913 bytes Full MD5 took 6.350180 seconds Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797 C:\test>quickMD5 25GB.csv Processing 25GB.csv : 26843545600 bytes Partial MD5 took 88.150608 seconds Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287 C:\test>quickMD5 500MB.csv Processing 500MB.csv : 536870913 bytes Partial MD5 took 2.081898 seconds Partial MD5: 894c2792caeaac64072d7189d5724ecc C:\test>md5t 25GB.csv Processing 25GB.csv : 26843545600 bytes Full MD5 took 302.419120 seconds Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: [OT]:Faster signature algorithm than md5?
by zentara (Archbishop) on Sep 01, 2012 at 16:06 UTC
    For speed, you might do simple checks first, and only do an md5 check if the file size and simple crc check is identical. You can also test just the first few thousand bytes of a file, instead of running crc or md5 on the entire file.
    #!/usr/bin/perl use warnings; use strict; # same as "sum -s file" use System V sum algorithm, use 512 bytes bl +ocks my $buf; open (FH,"< $0"); read( FH, $buf, -s FH ); close FH; my $checksum = unpack("%32C*",$buf) %32767; print "$checksum\n"; exit;

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: [OT]:Faster signature algorithm than md5?
by philiprbrenan (Monk) on Sep 01, 2012 at 15:33 UTC

    Given that using MD5 in this manner is a probabilistic method of determing whether two files are the same or not, why not just use file length as the determinator of possible equality and then compare the possibly equal files to make an absolute determination?

Re: [OT]:Faster signature algorithm than md5?
by RichardK (Parson) on Sep 01, 2012 at 15:54 UTC

    I would expect the file access time to dominate everything else, so choice of hash won't matter too much.

    You could try benchmarking different ones to see, maybe sha256 & md5?

    There was an interesting thread at high speed checksum for video finger printing?. You could possibly use a similar technique for your larger files, but you've still got to read it off the disk.

    OTOH you could just use rsync and go get a coffee :)

Re: [OT]:Faster signature algorithm than md5?
by Kenosis (Priest) on Sep 01, 2012 at 16:48 UTC
Re: [OT]:Faster signature algorithm than md5?
by swampyankee (Parson) on Sep 02, 2012 at 01:46 UTC

    Thanks everybody. I hadn't even thought to check file size before comparing file before comparing md5 signatures, nor had I thought to take a checksum of a good-sized chunk file as a second screen.

    A big "thank you!" to all of you.

    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: [OT]:Faster signature algorithm than md5?
by Anonymous Monk on Sep 17, 2012 at 17:21 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://991162]
Approved by marto
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2018-06-19 21:05 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (115 votes). Check out past polls.