http://www.perlmonks.org?node_id=120971


in reply to Re: Auto-reaping of duplicates
in thread Auto-reaping of duplicates

It's important to remember that MD5 produces a hash of the content. Just because two items produce the same checksum, doesn't mean their content is identical (if it did, then we would never need the infinite supply of monkeys, as there would only be 2^128 possible texts). It's almost a certainty, but not quite. It'd certainly be highly embarassing to block somebody's 5 page thesis, because it happened to have the same MD5 checksum as an already existing "me too!" post.

So MD5 can be used as a first cut for uniqueness, but still has to be followed up with a more precise check if the checksums do turn out to be the same. Just in case...

Replies are listed 'Best First'.
Re: Re: Re: Auto-reaping of duplicates
by blakem (Monsignor) on Oct 24, 2001 at 05:07 UTC
      only be 2^128 possible texts

    only? Only!? Do you have any idea how big 2^128 is?
    2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 > 3*10^38

    Which is bigger than the number of cups of water in all the oceans (6*10^21)
    Bigger than the distance from one end of the universe to the other in inches...(2*10^28)
    Bigger than the volume of the sun in cubic inches...(8*10^31)
    Bigger than the area of the galaxy in square miles...(3*10^35)
    Approaching the number of atoms in our atmosphere.....(2*10^44)

    (from bignum)

    Perhaps a secondary check is in order, but I'd hardly use 'only' when talking about 2^128 hash buckets.

    Update:
    Ok, lets play with the numbers some more:

    Let's assume perlmonks has 300,000 nodes (3*105 ) and has 3*1038 buckets in its hashing algorithm. The ratio of nodes/buckets is 3*105 : 3*1038 or 1 : 1033.

    Now, consider this lottery where you pick six different numbers from 1-49. Get all six right and you win the jackpot. As the page above notes, the chances of winning with one ticket are:

    1 : 13,983,816 ( (49*48*47*46*45*44)/(6*5*4*3*2*1) ) or about:
    1 : 107

    Lets buy one ticket a week for four weeks... odds of winning *all* four lotteries with our four tickets are: 1 : (107)4 or 1 : 1028 .

    That *still* doesn't get you there... after winning your four lotteries, we'll take you to one of the new huge NFL stadiums being built, and you have to gamble all your winnings on picking a specific, randomly-chosen seat (1 : 105)

    So the chances of my next post colliding with a node already in the database (1:1033 ) are about the same as you winning four lotteries on four tickets, then picking the single correct seat out of a gigantic stadium (1 : 1028*105)

    -Blake

Re: Re: Re: Auto-reaping of duplicates
by demerphq (Chancellor) on Oct 24, 2001 at 05:27 UTC
    Well I have quite a bit of trouble believing that two posts, with different authors and different names would generate the same MD5. I suppose its possible but I guess the post would have to be very very long indeed. I seriously doubt that its possible to get the same MD5 from different data when the data is small, especially as small as a post would be. But then I dont know the full workings of MD5...

    Of course however, the extra check is cheap so why not...

    :-)

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      It is possible to get conflicts once there are more than 2**128 possible posts, which takes 128 bits. That fits inside of 43 bytes, so it is certainly possible to get two posts that collide.

      In fact as long as we stay well below 2**64 posts (order of magnitude), the odds are very good that there are no accidental duplicates at all. It is possible that somewhere there is, but the odds are negligable.