[OT]:Faster signature algorithm than md5?

swampyankee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: [OT]:Faster signature algorithm than md5? by davido (Cardinal) on Sep 01, 2012 at 17:17 UTC
MD5 scales linearly (so does SHA-1, by the way). You will trade reliability for speed, I'm afraid. While adding up all the bytes is also a linear algorithm, it's a faster algorithm, but quite unreliable for your purposes. Rather than skipping large files, just limit the size of the portion of the file you calculate an MD5 on. Limit it to, for example, 2^16. You will have false-positive collisions. But they will be relatively infrequent. For each of collisions, go back and hash the entire file. Now in the unlikely event that you get a second round of collisions, you can do a byte by byte comparison. So the steps would be: For each file read the first 64KB or 128KB ($/ will be helpful) and generate a hash. For each collision read the entire file and generate a hash. For each whole-file-hash collision, do a byte by byte comparison. Git uses the SHA-1 algorithm to uniquely identify commits, as well as objects under its purview. I think that SHA-1 was chosen over MD5 because it has a smoother distribution (less likely for pathological datasets to have a higher than average possibility of causing a collision). Though still linear, it's a slower algorithm than MD5. But if you intend to follow the suggestion of only hashing the first 64, 128, or 256k of each file on first pass, speed shouldn't be such an issue, and reliability improves to the point that you may never have to do a byte-by-byte comparison as a third pass. If you switch to the slower SHA1 algorithm, and follow the steps above, I doubt you would ever exercise step three except for when you come across truly identical files, which would produce a collision at step three also. Theoretically it's possible to get a SHA1 false positive collision, but the distribution is pretty smooth, and 1/1000000000000000000000000000000000000000000000 is a really small probability. Dave	[reply]
Re: [OT]:Faster signature algorithm than md5? (4x faster) by BrowserUk (Patriarch) on Sep 01, 2012 at 17:55 UTC
As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file: `sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 10242; } return $md5; }` [download] This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases. So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy. (Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.) A simple test script: #! perl -slw use strict; use Time::HiRes qw[ time ]; use Digest::MD5; sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 10242; } return $md5; } open FH, '<', $ARGV[0] or die $!; printf "Processing $ARGV[0] : %u bytes\n", -s FH; my $start = time; my $qmd5 = quickMD5( *FH ); printf "Partial MD5 took %.6f seconds\n", time() - $start; print "Partial MD5: ", $qmd5->hexdigest; [download] And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect: C:\test>md5t 500MB.csv Processing 500MB.csv : 536870913 bytes Full MD5 took 6.350180 seconds Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797 C:\test>quickMD5 25GB.csv Processing 25GB.csv : 26843545600 bytes Partial MD5 took 88.150608 seconds Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287 C:\test>quickMD5 500MB.csv Processing 500MB.csv : 536870913 bytes Partial MD5 took 2.081898 seconds Partial MD5: 894c2792caeaac64072d7189d5724ecc C:\test>md5t 25GB.csv Processing 25GB.csv : 26843545600 bytes Full MD5 took 302.419120 seconds Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply] [d/l] [select]
Re: [OT]:Faster signature algorithm than md5? by zentara (Archbishop) on Sep 01, 2012 at 16:06 UTC
For speed, you might do simple checks first, and only do an md5 check if the file size and simple crc check is identical. You can also test just the first few thousand bytes of a file, instead of running crc or md5 on the entire file. `#!/usr/bin/perl use warnings; use strict; # same as "sum -s file" use System V sum algorithm, use 512 bytes bl +ocks my $buf; open (FH,"< $0"); read( FH, $buf, -s FH ); close FH; my $checksum = unpack("%32C*",$buf) %32767; print "$checksum\n"; exit;` [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re: [OT]:Faster signature algorithm than md5? by philiprbrenan (Monk) on Sep 01, 2012 at 15:33 UTC
Given that using MD5 in this manner is a probabilistic method of determing whether two files are the same or not, why not just use file length as the determinator of possible equality and then compare the possibly equal files to make an absolute determination?	[reply]
Re: [OT]:Faster signature algorithm than md5? by RichardK (Parson) on Sep 01, 2012 at 15:54 UTC
I would expect the file access time to dominate everything else, so choice of hash won't matter too much. You could try benchmarking different ones to see, maybe sha256 & md5? There was an interesting thread at high speed checksum for video finger printing?. You could possibly use a similar technique for your larger files, but you've still got to read it off the disk. OTOH you could just use rsync and go get a coffee :)	[reply]
Re: [OT]:Faster signature algorithm than md5? by Kenosis (Priest) on Sep 01, 2012 at 16:48 UTC
In case you haven't seen it, here's a PM Find duplicate files. thread which may assist you.	[reply]
Re: [OT]:Faster signature algorithm than md5? by swampyankee (Parson) on Sep 02, 2012 at 01:46 UTC
Thanks everybody. I hadn't even thought to check file size before comparing file before comparing md5 signatures, nor had I thought to take a checksum of a good-sized chunk file as a second screen. A big "thank you!" to all of you. Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply]
Re: [OT]:Faster signature algorithm than md5? by Anonymous Monk on Sep 17, 2012 at 17:21 UTC
There's a benchmark of all the Digest modules here: http://search.cpan.org/dist/Task-Digest/lib/Task/Digest.pm#BENCHMARKS	[reply]


No such thing as a small change
	PerlMonks