|Welcome to the Monastery|
Re^3: [OT] The statistics of hashing. (4<10)by tye (Sage)
|on Apr 01, 2012 at 19:00 UTC||Need Help??|
You appear to have not fully understood several parts of what I wrote. (Which is not an insult.)
And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.
That actually is not a valid conclusion from anything I wrote.
The odds of the next insert being a collision after about 1.5e9 inserts is about (1.5e9/2**32)**4 or under 1.5%, and 98.5% is not "miniscule" (and about 1/100th as fast is not "almost the same rate"). But that is also ignoring that there won't actually be 1.5e9 bits set. But you make no mention of how many bits are set which is how you'd get an accurate calculations so I'm curious if you perhaps didn't fully appreciate that point (as opposed to some other reason like it perhaps not being easy to query the number of bits set in your current run, etc.).
Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.
I suspect that is the case in the numbers you show it your reply above, because you don't show collisions happening on the order of every 1e2 inserts. With 10 hashes, the above odds go to around 2.7e-5. And that is in the ballpark of what you report seeing.
If you need some help redoing the calculations for 10 hashes instead of 4, then please speak up. I was not exhaustive in my explanations nor expansions of formulae, so I would expect more clarification could be required. I can also understand you not taking the time to seriously study what I wrote yet given how the numbers seem so far off from your observations.