Your skill will accomplish
what the force of many cannot
Re: [OT]:Faster signature algorithm than md5?by davido (Archbishop)
|on Sep 01, 2012 at 17:17 UTC||Need Help??|
Rather than skipping large files, just limit the size of the portion of the file you calculate an MD5 on. Limit it to, for example, 2^16. You will have false-positive collisions. But they will be relatively infrequent. For each of collisions, go back and hash the entire file.
Now in the unlikely event that you get a second round of collisions, you can do a byte by byte comparison. So the steps would be:
Git uses the SHA-1 algorithm to uniquely identify commits, as well as objects under its purview. I think that SHA-1 was chosen over MD5 because it has a smoother distribution (less likely for pathological datasets to have a higher than average possibility of causing a collision). Though still linear, it's a slower algorithm than MD5. But if you intend to follow the suggestion of only hashing the first 64, 128, or 256k of each file on first pass, speed shouldn't be such an issue, and reliability improves to the point that you may never have to do a byte-by-byte comparison as a third pass.
If you switch to the slower SHA1 algorithm, and follow the steps above, I doubt you would ever exercise step three except for when you come across truly identical files, which would produce a collision at step three also. Theoretically it's possible to get a SHA1 false positive collision, but the distribution is pretty smooth, and 1/1000000000000000000000000000000000000000000000 is a really small probability.