in reply to [OT]:Faster signature algorithm than md5?
As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO
Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file:
sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 1024**2; } return $md5; }
This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases.
So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy.
(Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.)
A simple test script:
And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect:
C:\test>md5t 500MB.csv Processing 500MB.csv : 536870913 bytes Full MD5 took 6.350180 seconds Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797 C:\test>quickMD5 25GB.csv Processing 25GB.csv : 26843545600 bytes Partial MD5 took 88.150608 seconds Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287 C:\test>quickMD5 500MB.csv Processing 500MB.csv : 536870913 bytes Partial MD5 took 2.081898 seconds Partial MD5: 894c2792caeaac64072d7189d5724ecc C:\test>md5t 25GB.csv Processing 25GB.csv : 26843545600 bytes Full MD5 took 302.419120 seconds Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b
|
---|