Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I will do my very best to try and understand the formula

Perhaps this will help some.

Below is some fairly simple code that does the precise calculations of the odds for a collision in a single hash (and compares those calculations with the formula roboticus proposed). I came up with a simpler implementation than I expected to. I didn't even try to implement this at first (only a little because I expected it to be more cumbersome, but) mostly because I knew it would be impractical for computing odds for such a large number of insertions. It consumes O($inserts) for memory and O($inserts**2) for CPU.

$|= 1; my( $b )= ( @ARGV, 2**32 ); # Total number of bits in the hash. my $i= 1; # Number of insertions done so far. my @odds = 1; # $odds[$s] == Odds of their being $s+1 bi +ts set while( 1 ) { # Just hit Ctrl-C when you've seen enough my $exp= $b*( 1 - exp(-$i/$b) ); my $avg = 0; $avg += $_ for map $odds[$_]*($_+1), 0..$#odds; my $err = sprintf "%.6f%%", 100*($exp-$avg)/$avg; print "$i inserts, $err: avg=$avg exp=$exp bits set\n" if $i =~ /^\d0*$/; # Update @odds to in preparation for the next value of $i: for my $s ( reverse 0..$#odds ) { $odds[$s+1] += $odds[$s]*($b-$s-1)/$b; $odds[$s] *= ($s+1)/$b; } $i++; }

$i tracks the number of insertions done so far. $odds[$s] represents the odds of there being $s+1 bits set in the hash (after $i insertions). $avg is an average of these values of $s (1..@odds) weighted by the odds. But, more importantly, it is also the odds of getting a single-hash collision when inserting (after $i insertions) except multiplied by $bits ($b). I multiple it and 1-exp(-$i/$b) by $b to normalize to the expected number of set bits instead of the odds of a collision because humans have a much easier time identifying a number that is "close to 14" than a number that is "close to 14/2**32".

$odds[-1] turns out to exactly match (successively) the values from birthday problem. For low numbers of $inserts, this swamps the calculation of $avg (the other terms just don't add up to a significant addition), which is part of why I was computing it for some values in my first reply. (Since you asked about that privately.)

I have yet to refresh my memory of the exact power series expansion of exp($x), so what follows is actually half guesses, but I'm pretty confident of them based on vague memory and observed behavior.

For large $bits, 1-exp(-$inserts/$bits) ends up being close to 1/$bits because 1-exp(-$inserts/$bits) expands (well, "can be expanded") to a power series where 1/$bits is the first term and the next term depends on 1/$bits**2 which is so much smaller that it doesn't matter much (and nor do any of the subsequent terms, even when added together).

On the other hand, for large values of $inserts, 1-exp(-$inserts/$bits) is close to $avg because the formula for $avg matches the first $inserts terms of the power series expansion.

I hope the simple code makes it easy for you to see how these calculations match the odds I described above. But don't hesitate to ask questions if the correspondence doesn't seem clear to you. Running the code shows how my calculations match roboticus' formula. Looking up (one of) the power series expansions for computing exp($x) should match the values being computed for @odds, though there might be some manipulation required to make the match apparent (based on previous times I've done such work decades ago).

- tye        

In reply to Re^4: [OT] The statistics of hashing. (dissection) by tye
in thread [OT] The statistics of hashing. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (6)
    As of 2018-06-25 12:27 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (126 votes). Check out past polls.