No such thing as a small change PerlMonks

### Re: [OT] The statistics of hashing.

by roboticus (Chancellor)
 on Apr 02, 2012 at 13:57 UTC ( #963012=note: print w/replies, xml ) Need Help??

in reply to [OT] The statistics of hashing.

That problem sure was fun! I've got a handle on it, but I didn't do the math rigorously, but it agrees with my simulations. Anyway, here's what I think is going on:

1) The density of 1 bits is:

```                -x/N
d = 1 - e
[download]```

2) The probability of the next item colliding is:

```                   -x/N  h
p(hit) = (1 - e     )
[download]```

If that's true, then I expect that the number of hits is just the integral of p(hit) over x in (0..records). (I haven't actually done the integration, but I don't foresee any hitches.) I found an on-line symbolic integrator that you can use to determine it.

Anyway, if you want to play with it, here's the simulator code along with the sample output:

Note: In the code below:

```\$cnt_rex          x      The sample number
\$VecSizeBits      N      Number of bits in the vector
\$num_vecs         h      Number of vectors/hash functions
@vectors                 Content of vectors
@cnt_bits                Number of bits set to 1 in vector
@cnt_colbits             Number of collisions:  I've broken it out to
[0] = Total document collisions
[1] = Count of documents colliding with 1 vec
+tor
[2] = Count of docs colliding with 2 vectors
...
\$first_coll              Index of first collision...useless, should re
+move
\$first_coll_f            Index of first false positive (all vectors co
+llide)
[download]```
```\$ cat analyze_bloom.pl
#!/usr/bin/perl
#
# analyze_bloom.pl <NumBits> <NumVecs> <ln2(Samples)>
#
# Experiment with Bloom filter variant by BUk
#
use strict;
use warnings;
use autodie;
use Bit::Vector;
use Math::BigInt;

my \$VecSizeBits=shift or die "Missing # of bits!";
my \$num_vecs   =shift or die "missing # of vectors!";
my \$iterations =shift or die "missing iteration count!";
if (\$iterations<33) {
\$iterations = 1<<\$iterations;
}

my \$VecSize = 1 << \$VecSizeBits;
my \$VecMask = \$VecSize-1;

print "NumVecs=\$num_vecs\n";

my @col2numbits = (0, 1);
while (@col2numbits < 1<<\$num_vecs) {
@col2numbits = (@col2numbits, map { \$_+1 } @col2numbits);
}
my @cnt_colbits = (0) x (1+\$num_vecs);

my %report_cnts = map { (1<<\$_)=>0 } (8 .. 24);

my @vectors;
my @cnt_bits = (0) x \$num_vecs;
\$vectors[\$_] = Bit::Vector::new(0,\$VecSize) for 0 .. \$num_vecs-1;

{
my @HDR = (
q{                avg      avg   calc  },
q{BI  Samples   one bits pct 1s pct 1s },
q{-- ---------- -------- ------ ------ },
);
my \$t = int((\$num_vecs+1) * 8 - length(" # collisions "))/2;
\$HDR[0] .= "-"x\$t . " # collisions " . "-"x\$t;
\$HDR[1] .= "  all   ";
\$HDR[1] .= sprintf("   % 2u   ", \$_) for 1 .. \$num_vecs;
\$HDR[1] .= "   p(coll)";
\$HDR[2] .= ("-"x7 . " ") x (\$num_vecs+1) . "------------";
print join("\n", @HDR, "");
}

my \$first_coll=0;
my \$cnt_coll=0;
my \$first_coll_f=0;
my \$cnt_coll_f=0;
my \$cnt_rex=0;
while (++\$cnt_rex < \$iterations) {
my @h = map { int(rand(\$VecSize)) } 0 .. \$VecMask;
my \$bits=32;

my \$collision=0;
for my \$i (0 .. \$num_vecs-1) {
if (\$vectors[\$i]->bit_test(\$h[\$i])) {
\$collision |= 1<<\$i
}
else {
++\$cnt_bits[\$i]
}
\$vectors[\$i]->Bit_On(\$h[\$i]);
}
if (\$collision) {
++\$cnt_colbits[\$col2numbits[\$collision]];
++\$cnt_colbits[0];          # 'any' collision
++\$cnt_coll;
\$first_coll=\$cnt_rex        if !\$first_coll;
if (\$collision == \$VecMask) {
++\$cnt_coll_f;
\$first_coll_f=\$cnt_rex  if !\$first_coll_f;
}
}

if (exists \$report_cnts{\$cnt_rex}) {
my \$ttl_bits=0;
\$ttl_bits += \$_ for @cnt_bits;
my \$avg_bits = \$ttl_bits / \$num_vecs;
my \$avg_pct_full = 100.0 * \$ttl_bits / (\$num_vecs*\$VecSize);
my \$calc_pct_full = 100.0 * (1-exp(-\$cnt_rex/\$VecSize));
printf "% 2u %10u: % 7u %6.2f %6.2f",
\$VecSizeBits, \$cnt_rex, \$avg_bits, \$avg_pct_full, \$calc
+_pct_full;
print " ",
join(" ",
map {sprintf"% 7u",\$_} @cnt_colbits
);
my \$expected = (1-exp(-\$cnt_rex/\$VecSize))**\$num_vecs;
printf " %.10f", \$expected;
print "\n";
}
}

\$ ./analyze_bloom.pl 10  3  13
NumVecs=3
avg      avg   calc  --------- # collisions ---------
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+  p(coll)
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+-----------
10        256:     222  21.74  22.12      86      72      14       0 0
+.0108230772
10        512:     405  39.62  39.35     250     186      59       5 0
+.0609161842
10       1024:     647  63.22  63.21     699     351     265      83 0
+.2525804578
10       2048:     892  87.11  86.47    1711     494     677     540 0
+.6464623148
10       4096:    1012  98.83  98.17    3758     511    1000    2247 0
+.9460533270

\$ ./analyze_bloom.pl 12  5  15
NumVecs=5
avg      avg   calc  ----------------- # collisions --
+---------------
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+   4       5      p(coll)
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+------ ------- ------------
12        256:     248   6.06   6.06      37      36       1       0
+     0       0 0.0000008164
12        512:     479  11.70  11.75     138     114      23       1
+     0       0 0.0000223999
12       1024:     908  22.17  22.12     434     311     103      17
+     3       0 0.0005295634
12       2048:    1604  39.17  39.35    1296     634     447     170
+    44       1 0.0094309292
12       4096:    2572  62.81  63.21    3282     915    1038     788
+   443      98 0.1009251903
12       8192:    3542  86.47  86.47    7372     979    1392    1775
+  1968    1258 0.4833243641
12      16384:    4020  98.15  98.17   15564     980    1420    2052
+  3718    7394 0.9117155503
[download]```

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: [OT] The statistics of hashing.
by roboticus (Chancellor) on Apr 02, 2012 at 17:45 UTC

I'm done with the program now. I added a column to show the predicted number of false positives. The code is nearly unchanged. Basically, I moved:

```    my \$expected = (1-exp(-\$cnt_rex/\$VecSize))**\$num_vecs;
[download]```

outside of the reporting section, and then accumulate it:

```    \$cum_collisions += \$expected;
[download]```

It's not very fast, so I only ran it a couple of times, but it seems like it gives reasonable agreement:

```\$ ./analyze_bloom.pl 14 3 17
NumVecs=3
avg      avg   calc  --------- # collisions ---------
+            expected
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+  p(coll)   false pos
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+----------- -------------
14        256:     253   1.55   1.55       7       7       0       0 0
+.0000037264 0.0002414791
14        512:     504   3.08   3.08      22      22       0       0 0
+.0000291236 0.0037774699
14       1024:     993   6.06   6.06      89      87       2       0 0
+.0002224011 0.0581208314
14       2048:    1926  11.76  11.75     336     309      26       1 0
+.0016223627 0.8630370129
14       4096:    3637  22.20  22.12    1185    1005     170      10 0
+.0108230772 11.9418752510
14       8192:    6414  39.15  39.35    3986    2802    1022     162 0
+.0609161842 144.4751474482
14      16384:   10300  62.87  63.21   11216    5615    4166    1435 0
+.2525804578 1374.7071068207
14      32768:   14151  86.37  86.47   27361    7789   10655    8917 0
+.6464623148 8946.4018915720
14      65536:   16105  98.30  98.17   60117    8201   15657   36259 0
+.9460533270 36391.1792023071

\$ ./analyze_bloom.pl 16 4 18
NumVecs=4
avg      avg   calc  ------------- # collisions ------
+-------             expected
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+   4      p(coll)   false pos
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+------ ------------ ----------
16        256:     255   0.39   0.39       3       3       0       0
+     0 0.0000000002 0.0000000120
16        512:     510   0.78   0.78       5       5       0       0
+     0 0.0000000037 0.0000003784
16       1024:    1018   1.55   1.55      21      21       0       0
+     0 0.0000000578 0.0000119226
16       2048:    2021   3.08   3.08     107     106       1       0
+     0 0.0000008960 0.0003713063
16       4096:    3979   6.07   6.06     444     423      20       1
+     0 0.0000134746 0.0112771478
16       8192:    7749  11.82  11.75    1615    1466     141       8
+     0 0.0001906326 0.3256727521
16      16384:   14569  22.23  22.12    5843    4608    1066     158
+    11 0.0023940562 8.5228328538
16      32768:   25931  39.57  39.35   18293   11116    5475    1529
+   173 0.0239686508 185.0883611550
16      65536:   41690  63.61  63.21   49093   18714   17246   10354
+  2779 0.1596613002 2882.5123468408
16     131072:   56843  86.74  86.47  114278   21592   29488   36446
+ 26752 0.5589731543 26626.3779623356
[download]```

The updated code:

Both runs are still underway, I'll update this when they complete. Done.

Update: Added rest of the tables. As you can see from the two sample runs, there's good agreement between the actual false positive count and the expected count.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Re^2: [OT] The statistics of hashing.
by roboticus (Chancellor) on Apr 02, 2012 at 22:50 UTC

OK, I think I'm done with this diversion. I've built a program that will compute the expected number of collisions with no looping. Example output:

```\$ ./ana_2.pl 16 4 65536
N=65536, V=4, X=65536 integral(65536)=139415.765849051, integral(0)=13
+6533.333333333
Expected collisions: 2882.43251571807

\$ ./ana_2.pl 16 4 16384
N=65536, V=4, X=16384 integral(16384)=136541.854969116, integral(0)=13
+6533.333333333
Expected collisions: 8.52163578287582

\$ ./ana_2.pl 14 3 16384
N=16384, V=3, X=16384 integral(16384)=31411.9141476821, integral(0)=30
+037.3333333333
Expected collisions: 1374.58081434877

\$ ./ana_2.pl 16 10 32768
N=65536, V=10, X=32768 integral(32768)=191953.190301726, integral(0)=1
+91952.863492063
Expected collisions: 0.326809662627056
[download]```

The code is straightforward:

Notes:

1. The only testing I did is limited to the cases above. Be sure to run a few more before relying on it.
2. I entered the ex_1 and ex_2 routines, but didn't test 'em.
3. For the other numbers of vectors, just expand (1 - exp(-X/N))^V, and integrate it. The ex_1..n functions are coded straight from the integration.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Thank you roboticus.

I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning!

```#! perl -slw
use strict;

sub ex_10 {
my (\$N, \$X) = @_;
return       -\$N/10*exp(-10*\$X/\$N)
+  10*\$N/9 *exp( -9*\$X/\$N)
-  45*\$N/8 *exp( -8*\$X/\$N)
+ 120*\$N/7 *exp( -7*\$X/\$N)
-  35*\$N   *exp( -6*\$X/\$N)
+ 252*\$N/5 *exp( -5*\$X/\$N)
- 105*\$N/2 *exp( -4*\$X/\$N)
+  40*\$N   *exp( -3*\$X/\$N)
-  45*\$N/2 *exp( -2*\$X/\$N)
+  10*\$N   *exp( -1*\$X/\$N)
+ \$X;
}

my \$exp10_0 = ex_10( 2**32, 0 );

open I, '<', 'L25B32V10I32S1.posns' or die \$!;

while( <I> ) {
my( \$inserts, \$collisions ) = split;
my \$exp3 = ex_10( 2**32, \$inserts ) - \$exp10_0;
printf "Insert  %10d: Predicted  %6d Actual: %6d Delta: %+.3f%%\n"
+,
\$inserts, \$exp3, \$collisions, ( \$collisions - \$exp3 ) / \$exp3
+* 100;
}

__END__
C:\test>rbtcs-form-verify
Insert   779967210: Predicted       1 Actual:      1 Delta: -18.051%
Insert   782382025: Predicted       1 Actual:      2 Delta: +58.813%
Insert   830840395: Predicted       2 Actual:      3 Delta: +29.292%
Insert   882115420: Predicted       4 Actual:      4 Delta: -5.961%
Insert   883031614: Predicted       4 Actual:      5 Delta: +16.325%
Insert   897571477: Predicted       5 Actual:      6 Delta: +18.390%
Insert   923155269: Predicted       6 Actual:      7 Delta: +4.086%
Insert   948108745: Predicted       8 Actual:      8 Delta: -8.996%
Insert   954455113: Predicted       9 Actual:      9 Delta: -4.244%
Insert   967783959: Predicted      10 Actual:     10 Delta: -7.404%
Insert   988381482: Predicted      13 Actual:     11 Delta: -17.487%
Insert   992691311: Predicted      13 Actual:     12 Delta: -13.814%
Insert   995935158: Predicted      14 Actual:     13 Delta: -9.624%
Insert  1011301141: Predicted      16 Actual:     14 Delta: -16.457%
Insert  1013742872: Predicted      17 Actual:     15 Delta: -12.616%
Insert  1022258193: Predicted      18 Actual:     16 Delta: -14.242%
Insert  1031874023: Predicted      20 Actual:     17 Delta: -16.989%
Insert  1034026887: Predicted      20 Actual:     18 Delta: -13.909%
Insert  1036254774: Predicted      21 Actual:     19 Delta: -11.051%
Insert  1037064360: Predicted      21 Actual:     20 Delta: -7.093%
Insert  1037193945: Predicted      21 Actual:     21 Delta: -2.569%
Insert  1037309710: Predicted      21 Actual:     22 Delta: +1.957%
...( ~ 52000 ellided }
Insert  2375752842: Predicted   52209 Actual:  52277 Delta: +0.130%
Insert  2375755671: Predicted   52209 Actual:  52278 Delta: +0.131%
Insert  2375756509: Predicted   52209 Actual:  52279 Delta: +0.133%
Insert  2375760656: Predicted   52210 Actual:  52280 Delta: +0.133%
Insert  2375763928: Predicted   52211 Actual:  52281 Delta: +0.134%
Insert  2375785238: Predicted   52215 Actual:  52282 Delta: +0.128%
Insert  2375788721: Predicted   52215 Actual:  52283 Delta: +0.128%
Insert  2375789878: Predicted   52216 Actual:  52284 Delta: +0.130%
Insert  2375790896: Predicted   52216 Actual:  52285 Delta: +0.131%
Insert  2375798283: Predicted   52217 Actual:  52286 Delta: +0.131%

[download]```

And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there.

My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass:

```C:\test>rbtcs-bloom-probb 32 10 30e8
N=4294967296, V=10, X=30e8 integral(30e8)=12580199467.653, integral(0)
+=12579822861.8159
Expected collisions: 376605.837186813
[download]```

That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run.

The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently.

And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

I will do my very best to try and understand the formula

Perhaps this will help some.

Below is some fairly simple code that does the precise calculations of the odds for a collision in a single hash (and compares those calculations with the formula roboticus proposed). I came up with a simpler implementation than I expected to. I didn't even try to implement this at first (only a little because I expected it to be more cumbersome, but) mostly because I knew it would be impractical for computing odds for such a large number of insertions. It consumes O(\$inserts) for memory and O(\$inserts**2) for CPU.

```\$|= 1;
my( \$b )= ( @ARGV, 2**32 ); # Total number of bits in the hash.
my \$i= 1;                   # Number of insertions done so far.
my @odds = 1;               # \$odds[\$s] == Odds of their being \$s+1 bi
+ts set
while( 1 ) {                # Just hit Ctrl-C when you've seen enough
my \$exp= \$b*( 1 - exp(-\$i/\$b) );
my \$avg = 0;
\$avg += \$_ for map \$odds[\$_]*(\$_+1), 0..\$#odds;
my \$err = sprintf "%.6f%%", 100*(\$exp-\$avg)/\$avg;
print "\$i inserts, \$err: avg=\$avg exp=\$exp bits set\n"
if  \$i =~ /^\d0*\$/;
# Update @odds to in preparation for the next value of \$i:
for my \$s (  reverse 0..\$#odds  ) {
\$odds[\$s+1] += \$odds[\$s]*(\$b-\$s-1)/\$b;
\$odds[\$s] *= (\$s+1)/\$b;
}
\$i++;
}
[download]```

\$i tracks the number of insertions done so far. \$odds[\$s] represents the odds of there being \$s+1 bits set in the hash (after \$i insertions). \$avg is an average of these values of \$s (1..@odds) weighted by the odds. But, more importantly, it is also the odds of getting a single-hash collision when inserting (after \$i insertions) except multiplied by \$bits (\$b). I multiple it and 1-exp(-\$i/\$b) by \$b to normalize to the expected number of set bits instead of the odds of a collision because humans have a much easier time identifying a number that is "close to 14" than a number that is "close to 14/2**32".

\$odds[-1] turns out to exactly match (successively) the values from birthday problem. For low numbers of \$inserts, this swamps the calculation of \$avg (the other terms just don't add up to a significant addition), which is part of why I was computing it for some values in my first reply. (Since you asked about that privately.)

I have yet to refresh my memory of the exact power series expansion of exp(\$x), so what follows is actually half guesses, but I'm pretty confident of them based on vague memory and observed behavior.

For large \$bits, 1-exp(-\$inserts/\$bits) ends up being close to 1/\$bits because 1-exp(-\$inserts/\$bits) expands (well, "can be expanded") to a power series where 1/\$bits is the first term and the next term depends on 1/\$bits**2 which is so much smaller that it doesn't matter much (and nor do any of the subsequent terms, even when added together).

On the other hand, for large values of \$inserts, 1-exp(-\$inserts/\$bits) is close to \$avg because the formula for \$avg matches the first \$inserts terms of the power series expansion.

I hope the simple code makes it easy for you to see how these calculations match the odds I described above. But don't hesitate to ask questions if the correspondence doesn't seem clear to you. Running the code shows how my calculations match roboticus' formula. Looking up (one of) the power series expansions for computing exp(\$x) should match the values being computed for @odds, though there might be some manipulation required to make the match apparent (based on previous times I've done such work decades ago).

- tye

With the 'cancelling out' you've performed on the constants in your ex_*() subs, I couldn't see the pattern by which they were derived.

Undoing that, I now see that they come (directly or not) from Pascal's Triangle.

I guess at some point I should review some teaching materials to understand why those constants are used here, but for now it is enough to know that I can now write a generic ex_*() function (generator). Thanks.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants.

So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get:

```
f(x) = 1 - 2e^(-x/N) + e^(-2x/N)

integ(f(x)) = integ( 1 - 2e^(-x/N) + e^(-2x/N) )
= integ( 1 ) - 2*(integ(e^(-x/N))) + integ(e^(-2x/N))

since:
integ(1) = x + C
integ(e^(bx)) = (1/b)e^(bx) + C

integ(f(x)) = [ x+C ] + [ (-2N)e^(-x/N) + C ] + [ (-N/2)e^(-2x/N) + C
+]
[download]```

Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel.

Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play.

One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Re^2: [OT] The statistics of hashing.
by roboticus (Chancellor) on Apr 02, 2012 at 14:07 UTC

Oh, something I forgot to mention: I tried using a constant number of bits but varying the vector size/quantity to see how things scaled. In other words, I compared:

vec size# vectors
10,000 1
5,000 2
3,333 3
2,500 4
2,000 5
1,00010

I found more smaller vectors works better until the number of samples matches the number if bits in the smaller vector. Plotting the functions:

```(1-exp(-x/1000))^10
(1-exp(-x/2000))^5
(1-exp(-x/2500))^4
(1-exp(-x/3333))^3
1-exp(-x/10000)
[download]```

using a graphing calculator shows that's where the curves cross:

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Re^2: [OT] The statistics of hashing. (accuracy)
by tye (Sage) on Apr 02, 2012 at 21:55 UTC

I should review my transcendental function definitions as power series so I can recognize these correlations in future. I knew how to precisely measure the odds but had no clue how to search to see if there were any way to collapse it to some transcendental function (the exact measurement is quite impractical for very large numbers of inserts).

But having 1-exp(-\$inserts/\$slots) shown to me, I was able to play with that approximation to see how accurate it is. For this application, it is actually an impressive fit.

If 1-exp(-1/\$slots) starts out very close to 1/\$slots (the correct starting value for the odds of a single-bit collision), then the accuracy never decreases as \$inserts grows. For small \$slots, 1-exp(-1/\$slots) is not very accurate but 1-exp(-\$inserts/\$slots) slowly becomes more and more accurate as \$inserts increases.

For \$slots == 2**32, 1-exp(-\$inserts/\$slots) starts out (at 1==\$inserts) so accurate that a 'double' can't even measure the error. So you can consider the calculations based on it "perfect" rather than an approximation. :)

- tye

Log In?
 Username: Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://963012]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2018-03-19 07:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (236 votes). Check out past polls.

Notices?