As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO
Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file:
sub quickMD5 {
my $fh = shift;
my $md5 = new Digest::MD5->new;
$md5->add( -s $fh );
my $pos = 0;
until( eof $fh ) {
seek $fh, $pos, 0;
read( $fh, my $block, 4096 ) or last;
$md5->add( $block );
$pos += 1024**2;
}
return $md5;
}
This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases.
So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy.
(Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.)
A simple test script: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
use Digest::MD5;
sub quickMD5 {
my $fh = shift;
my $md5 = new Digest::MD5->new;
$md5->add( -s $fh );
my $pos = 0;
until( eof $fh ) {
seek $fh, $pos, 0;
read( $fh, my $block, 4096 ) or last;
$md5->add( $block );
$pos += 1024**2;
}
return $md5;
}
open FH, '<', $ARGV[0] or die $!;
printf "Processing $ARGV[0] : %u bytes\n", -s FH;
my $start = time;
my $qmd5 = quickMD5( *FH );
printf "Partial MD5 took %.6f seconds\n", time() - $start;
print "Partial MD5: ", $qmd5->hexdigest;
And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect: C:\test>md5t 500MB.csv
Processing 500MB.csv : 536870913 bytes
Full MD5 took 6.350180 seconds
Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797
C:\test>quickMD5 25GB.csv
Processing 25GB.csv : 26843545600 bytes
Partial MD5 took 88.150608 seconds
Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287
C:\test>quickMD5 500MB.csv
Processing 500MB.csv : 536870913 bytes
Partial MD5 took 2.081898 seconds
Partial MD5: 894c2792caeaac64072d7189d5724ecc
C:\test>md5t 25GB.csv
Processing 25GB.csv : 26843545600 bytes
Full MD5 took 302.419120 seconds
Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
|