http://www.perlmonks.org?node_id=991178


in reply to [OT]:Faster signature algorithm than md5?

As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO

Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file:

sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 1024**2; } return $md5; }

This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases.

So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy.

(Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.)

A simple test script:

And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect:

C:\test>md5t 500MB.csv Processing 500MB.csv : 536870913 bytes Full MD5 took 6.350180 seconds Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797 C:\test>quickMD5 25GB.csv Processing 25GB.csv : 26843545600 bytes Partial MD5 took 88.150608 seconds Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287 C:\test>quickMD5 500MB.csv Processing 500MB.csv : 536870913 bytes Partial MD5 took 2.081898 seconds Partial MD5: 894c2792caeaac64072d7189d5724ecc C:\test>md5t 25GB.csv Processing 25GB.csv : 26843545600 bytes Full MD5 took 302.419120 seconds Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong