http://www.perlmonks.org?node_id=120240


in reply to Auto-reaping of duplicates

Well, if its exact duplicates you're worried about then why not set up a unique index that contained the MD5 checksum of a post and prevent them from ever being allowed in the DB in the first place? That would be pretty simple to calculate and very fast, and pretty low memory overhead as well (OTOH I havent looking into the Everything code). If it was configured to quietly ignore the dupes I would guess it would be an easy fix.

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Replies are listed 'Best First'.
Re: Re: Auto-reaping of duplicates
by tommyw (Hermit) on Oct 24, 2001 at 04:50 UTC

    It's important to remember that MD5 produces a hash of the content. Just because two items produce the same checksum, doesn't mean their content is identical (if it did, then we would never need the infinite supply of monkeys, as there would only be 2^128 possible texts). It's almost a certainty, but not quite. It'd certainly be highly embarassing to block somebody's 5 page thesis, because it happened to have the same MD5 checksum as an already existing "me too!" post.

    So MD5 can be used as a first cut for uniqueness, but still has to be followed up with a more precise check if the checksums do turn out to be the same. Just in case...

        only be 2^128 possible texts

      only? Only!? Do you have any idea how big 2^128 is?
      2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 > 3*10^38

      Which is bigger than the number of cups of water in all the oceans (6*10^21)
      Bigger than the distance from one end of the universe to the other in inches...(2*10^28)
      Bigger than the volume of the sun in cubic inches...(8*10^31)
      Bigger than the area of the galaxy in square miles...(3*10^35)
      Approaching the number of atoms in our atmosphere.....(2*10^44)

      (from bignum)

      Perhaps a secondary check is in order, but I'd hardly use 'only' when talking about 2^128 hash buckets.

      Update:
      Ok, lets play with the numbers some more:

      Let's assume perlmonks has 300,000 nodes (3*105 ) and has 3*1038 buckets in its hashing algorithm. The ratio of nodes/buckets is 3*105 : 3*1038 or 1 : 1033.

      Now, consider this lottery where you pick six different numbers from 1-49. Get all six right and you win the jackpot. As the page above notes, the chances of winning with one ticket are:

      1 : 13,983,816 ( (49*48*47*46*45*44)/(6*5*4*3*2*1) ) or about:
      1 : 107

      Lets buy one ticket a week for four weeks... odds of winning *all* four lotteries with our four tickets are: 1 : (107)4 or 1 : 1028 .

      That *still* doesn't get you there... after winning your four lotteries, we'll take you to one of the new huge NFL stadiums being built, and you have to gamble all your winnings on picking a specific, randomly-chosen seat (1 : 105)

      So the chances of my next post colliding with a node already in the database (1:1033 ) are about the same as you winning four lotteries on four tickets, then picking the single correct seat out of a gigantic stadium (1 : 1028*105)

      -Blake

      Well I have quite a bit of trouble believing that two posts, with different authors and different names would generate the same MD5. I suppose its possible but I guess the post would have to be very very long indeed. I seriously doubt that its possible to get the same MD5 from different data when the data is small, especially as small as a post would be. But then I dont know the full workings of MD5...

      Of course however, the extra check is cheap so why not...

      :-)

      Yves
      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

        It is possible to get conflicts once there are more than 2**128 possible posts, which takes 128 bits. That fits inside of 43 bytes, so it is certainly possible to get two posts that collide.

        In fact as long as we stay well below 2**64 posts (order of magnitude), the odds are very good that there are no accidental duplicates at all. It is possible that somewhere there is, but the odds are negligable.