Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^3: Assistance with file compare

by bichonfrise74 (Vicar)
on Oct 28, 2009 at 20:04 UTC ( #803781=note: print w/replies, xml ) Need Help??


in reply to Re^2: Assistance with file compare
in thread Assistance with file compare

So, if the names are not the same. How would you then know that file_1 is equal to file_2?

If you do a md5sum on all the files and create something like a hash (eg. file_1: md5sum output), then compare the md5sum. If the md5sum are the same, then the files are the same? But what the chances that two files have the same md5sum but in reality they are not the same.

Replies are listed 'Best First'.
Re^4: Assistance with file compare
by keszler (Priest) on Oct 28, 2009 at 20:14 UTC
    MD5 hashes are typically 32 hexadecimal digits, so theoretically the chances of a collision are one in 3216, i.e. 1,208,925,819,614,629,174,706,176.

      You made two errors.

      32 hexadecimal digits is 1632 numbers. MD5 hashes are 128 bits in size. 16 hex digits are used to represent the hash since 1632 is equal to 2128.

      The chances of a collision cannot be ascertained since you haven't shown that every hash is equally likely to be generated.

        Yikes - I must have had a math-dyslexic moment. You are of course correct: 3.4e+38 possible results from the MD5 hash. My "theoretically" was my too-concise attempt at your "[not proven to be] equally likely to be generated".

        I do have a related anecdote: I worked on a site where users uploaded video clips, supposedly original content that they personally recorded. Some of them attempted to cheat by changing the filename and uploading dupes in all but name to increase their stats.

        One counter-measure I implemented was storing MD5 hashes for each clip. Generating hashes for the existing clips - over 250,000 of them - took many days. Newly uploaded clips, of course, had theirs generated on the fly.

        Within days of implementing the hash check, a user complained that his clip was tagged as duplicate, but he swore he'd never uploaded it before. Turned out he was correct. It was not a hash collision; he was trying to upload a clip that someone else had previously put into the system. (IIRC, neither of them were the actual owner...)

        In the months that followed, as tens of thousands more file were uploaded, every occurrence of duplicate hashes turned out to be duplicate files.

        Granted, less than 500,000 versus 3.4e+38 is far from a definitive test, but I think it's safe to say that the chances of a hash collision are vanishingly remote.

Re^4: Assistance with file compare
by Karger78 (Beadle) on Oct 28, 2009 at 20:18 UTC
    some of the files would be the same. The names may change, the files size etc would stay the same. So the only varible chaing would be the name.

      Well, there's your answer right there: just compare the file sizes. Then only compare (or md5) files whose sizes match. As long as your files aren't all the same length, that could be the fastest.

        just compare the file sizes. Then only compare (or md5) files whose sizes match.

        Not quite. If you need to be absolutely sure the files are identical, the following are effecient ways of achieving this:

        1. Identify files with the same file size.
        2. Of the files with the same file size, identify the files which are identical.

        or

        1. Identify the files with the same hash.
        2. Of the files with the same hash, identify the files which are identical.

        or

        1. Identify files with the same file size.
        2. Of the files with the same file size, identify the files with the same hash.
        3. Of the files with the same file size and hash, identify the files which are identical.

        If you're dealing with many files, the second method is probably the best.
        If you're dealing with just a few files, the first method is probably better.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://803781]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2018-02-21 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When it is dark outside I am happiest to see ...














    Results (283 votes). Check out past polls.

    Notices?