note
tobyink
<blockquote><p><i>What is the probability of getting false positives?</i></p></blockquote>
<p>Seems almost certain to me, but my reasoning may well be faulty. Stats were never my forte.</p>
<p>You're effectively running four different 32-bit hash functions on each string. <a href="http://en.wikipedia.org/wiki/Birthday_attack">According to Wikipedia</a> for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.</p>
<p>Thus with 110,000 strings, getting a collision in all four stands at 75%**4, about 30% chance. That's with 110,000 strings. You have "several billion".</p>
<p>With several billion strings you are going to have a very close to 100% chance of a collision on a 32-bit hash function. With four 32-bit hash functions, that's still close to 100% chance of a collision.</p>
<p>Extending this to having 10 effective hash functions, this cannot increase the chance of collisions. Worst case scenario it has no effect. Here I think it does decrease the chance (difficult , but not enough to be useful.</p>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-757127">
<small><small>
<tt>perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
</tt></small></small>
</div></div>
962802
962802