Re: Auto-reaping of duplicates

in reply to Auto-reaping of duplicates

Well, if its exact duplicates you're worried about then why not set up a unique index that contained the MD5 checksum of a post and prevent them from ever being allowed in the DB in the first place? That would be pretty simple to calculate and very fast, and pretty low memory overhead as well (OTOH I havent looking into the Everything code). If it was configured to quietly ignore the dupes I would guess it would be an easy fix.

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Comment on Re: Auto-reaping of duplicates

Replies are listed 'Best First'.
Re: Re: Auto-reaping of duplicates by tommyw (Hermit) on Oct 24, 2001 at 04:50 UTC
It's important to remember that MD5 produces a hash of the content. Just because two items produce the same checksum, doesn't mean their content is identical (if it did, then we would never need the infinite supply of monkeys, as there would only be 2^128 possible texts). It's almost a certainty, but not quite. It'd certainly be highly embarassing to block somebody's 5 page thesis, because it happened to have the same MD5 checksum as an already existing "me too!" post. So MD5 can be used as a first cut for uniqueness, but still has to be followed up with a more precise check if the checksums do turn out to be the same. Just in case...	[reply]
Re: Re: Re: Auto-reaping of duplicates by blakem (Monsignor) on Oct 24, 2001 at 05:07 UTC
only be 2^128 possible texts only? Only!? Do you have any idea how big 2^128 is? 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 > 310^38 Which is bigger than the number of cups of water in all the oceans (610^21) Bigger than the distance from one end of the universe to the other in inches...(210^28) Bigger than the volume of the sun in cubic inches...(810^31) Bigger than the area of the galaxy in square miles...(310^35) Approaching the number of atoms in our atmosphere.....(210^44) (from bignum) Perhaps a secondary check is in order, but I'd hardly use 'only' when talking about 2^128 hash buckets. Update: Ok, lets play with the numbers some more: Let's assume perlmonks has 300,000 nodes (310⁵ ) and has 310³⁸ buckets in its hashing algorithm. The ratio of nodes/buckets is 310⁵ : 310³⁸ or 1 : 10³³. Now, consider this lottery where you pick six different numbers from 1-49. Get all six right and you win the jackpot. As the page above notes, the chances of winning with one ticket are: 1 : 13,983,816 ( (494847464544)/(654321) ) or about: 1 : 10⁷ Lets buy one ticket a week for four weeks... odds of winning all four lotteries with our four tickets are: 1 : (10⁷)⁴ or 1 : 10²⁸ . That still doesn't get you there... after winning your four lotteries, we'll take you to one of the new huge NFL stadiums being built, and you have to gamble all your winnings on picking a specific, randomly-chosen seat (1 : 10⁵) So the chances of my next post colliding with a node already in the database (1:10³³ ) are about the same as you winning four lotteries on four tickets, then picking the single correct seat out of a gigantic stadium (1 : 10²⁸*10⁵) -Blake	[reply]
Re: Re: Re: Auto-reaping of duplicates by demerphq (Chancellor) on Oct 24, 2001 at 05:27 UTC
Well I have quite a bit of trouble believing that two posts, with different authors and different names would generate the same MD5. I suppose its possible but I guess the post would have to be very very long indeed. I seriously doubt that its possible to get the same MD5 from different data when the data is small, especially as small as a post would be. But then I dont know the full workings of MD5... Of course however, the extra check is cheap so why not... :-) Yves -- You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)	[reply]
Re (tilly) 4: Auto-reaping of duplicates by tilly (Archbishop) on Oct 24, 2001 at 08:41 UTC
It is possible to get conflicts once there are more than 2128 possible posts, which takes 128 bits. That fits inside of 43 bytes, so it is certainly possible to get two posts that collide. In fact as long as we stay well below 264 posts (order of magnitude), the odds are very good that there are no accidental duplicates at all. It is possible that somewhere there is, but the odds are negligable.	[reply]

In Section Perl Monks Discussion