Re: Remove Duplicate Files

by gaal (Parson)
on Oct 29, 2004 at 07:07 UTC

in reply to Remove Duplicate Files

MD5 collisions are rare, but they can happen. If you want to be really safe, your storage should not just keep track of seen hashes; it should make them the key of a list of files that have those hashes. Then when you detect a seen hash, you should byte-compare the new file with all the existing files on that list.

This, of course, is slower, adds complexity, and will rarely be useful; but personally, I want code that deletes files automatically to be correct!

Replies are listed 'Best First'.
Re^2: Remove Duplicate Files
by DrHyde (Prior) on Oct 29, 2004 at 08:27 UTC
    Agreed. You can make it a lot more efficient by stat()ing all the files and only bothering to compare the contents of those which are the same size. Another small improvement can come from noting that those with the same device number and inode number are guaranteed to be the same so no need to compare their contents, although this may not be portable to non-Unixy platforms.

    You should also be careful about how you compare symlinks and device files.

      And further improvement can be made by reading in just the first 1024 bytes or so, and calculate the md5 from that. Only if those match, you do a full comparison.
      Then again, hardlinks are less of a concern for cleanup, because they don't waste disk space.
        Well, any program that compares files and removes duplicates that doesn't look at whether they are links, will remove excess links. By looking at the inodes and device numbers to detect links, you can gain one of two things: have the option to *keep* links - which can be pretty useful for binaries that act different on how they are invoked, or a more speedy comparions, as you don't have to calculate the md5 hash, and then then compare the entire file.

Node Type: note [id://403616]
