Your skill will accomplishwhat the force of many cannot PerlMonks

### Comment on

 Need Help??

That problem sure was fun! I've got a handle on it, but I didn't do the math rigorously, but it agrees with my simulations. Anyway, here's what I think is going on:

1) The density of 1 bits is:

```                -x/N
d = 1 - e

2) The probability of the next item colliding is:

```                   -x/N  h
p(hit) = (1 - e     )

If that's true, then I expect that the number of hits is just the integral of p(hit) over x in (0..records). (I haven't actually done the integration, but I don't foresee any hitches.) I found an on-line symbolic integrator that you can use to determine it.

Anyway, if you want to play with it, here's the simulator code along with the sample output:

Note: In the code below:

```\$cnt_rex          x      The sample number
\$VecSizeBits      N      Number of bits in the vector
\$num_vecs         h      Number of vectors/hash functions
@vectors                 Content of vectors
@cnt_bits                Number of bits set to 1 in vector
@cnt_colbits             Number of collisions:  I've broken it out to
[0] = Total document collisions
[1] = Count of documents colliding with 1 vec
+tor
[2] = Count of docs colliding with 2 vectors
...
\$first_coll              Index of first collision...useless, should re
+move
\$first_coll_f            Index of first false positive (all vectors co
+llide)
```\$ cat analyze_bloom.pl
#!/usr/bin/perl
#
# analyze_bloom.pl <NumBits> <NumVecs> <ln2(Samples)>
#
# Experiment with Bloom filter variant by BUk
#
use strict;
use warnings;
use autodie;
use Bit::Vector;
use Math::BigInt;

my \$VecSizeBits=shift or die "Missing # of bits!";
my \$num_vecs   =shift or die "missing # of vectors!";
my \$iterations =shift or die "missing iteration count!";
if (\$iterations<33) {
\$iterations = 1<<\$iterations;
}

my \$VecSize = 1 << \$VecSizeBits;

print "NumVecs=\$num_vecs\n";

my @col2numbits = (0, 1);
while (@col2numbits < 1<<\$num_vecs) {
@col2numbits = (@col2numbits, map { \$_+1 } @col2numbits);
}
my @cnt_colbits = (0) x (1+\$num_vecs);

my %report_cnts = map { (1<<\$_)=>0 } (8 .. 24);

my @vectors;
my @cnt_bits = (0) x \$num_vecs;
\$vectors[\$_] = Bit::Vector::new(0,\$VecSize) for 0 .. \$num_vecs-1;

{
my @HDR = (
q{                avg      avg   calc  },
q{BI  Samples   one bits pct 1s pct 1s },
q{-- ---------- -------- ------ ------ },
);
my \$t = int((\$num_vecs+1) * 8 - length(" # collisions "))/2;
\$HDR[0] .= "-"x\$t . " # collisions " . "-"x\$t;
\$HDR[1] .= "  all   ";
\$HDR[1] .= sprintf("   % 2u   ", \$_) for 1 .. \$num_vecs;
\$HDR[1] .= "   p(coll)";
\$HDR[2] .= ("-"x7 . " ") x (\$num_vecs+1) . "------------";
print join("\n", @HDR, "");
}

my \$first_coll=0;
my \$cnt_coll=0;
my \$first_coll_f=0;
my \$cnt_coll_f=0;
my \$cnt_rex=0;
while (++\$cnt_rex < \$iterations) {
my @h = map { int(rand(\$VecSize)) } 0 .. \$VecMask;
my \$bits=32;

my \$collision=0;
for my \$i (0 .. \$num_vecs-1) {
if (\$vectors[\$i]->bit_test(\$h[\$i])) {
\$collision |= 1<<\$i
}
else {
++\$cnt_bits[\$i]
}
\$vectors[\$i]->Bit_On(\$h[\$i]);
}
if (\$collision) {
++\$cnt_colbits[\$col2numbits[\$collision]];
++\$cnt_colbits[0];          # 'any' collision
++\$cnt_coll;
\$first_coll=\$cnt_rex        if !\$first_coll;
++\$cnt_coll_f;
\$first_coll_f=\$cnt_rex  if !\$first_coll_f;
}
}

if (exists \$report_cnts{\$cnt_rex}) {
my \$ttl_bits=0;
\$ttl_bits += \$_ for @cnt_bits;
my \$avg_bits = \$ttl_bits / \$num_vecs;
my \$avg_pct_full = 100.0 * \$ttl_bits / (\$num_vecs*\$VecSize);
my \$calc_pct_full = 100.0 * (1-exp(-\$cnt_rex/\$VecSize));
printf "% 2u %10u: % 7u %6.2f %6.2f",
\$VecSizeBits, \$cnt_rex, \$avg_bits, \$avg_pct_full, \$calc
+_pct_full;
print " ",
join(" ",
map {sprintf"% 7u",\$_} @cnt_colbits
);
my \$expected = (1-exp(-\$cnt_rex/\$VecSize))**\$num_vecs;
printf " %.10f", \$expected;
print "\n";
}
}

\$ ./analyze_bloom.pl 10  3  13
NumVecs=3
avg      avg   calc  --------- # collisions ---------
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+  p(coll)
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+-----------
10        256:     222  21.74  22.12      86      72      14       0 0
+.0108230772
10        512:     405  39.62  39.35     250     186      59       5 0
+.0609161842
10       1024:     647  63.22  63.21     699     351     265      83 0
+.2525804578
10       2048:     892  87.11  86.47    1711     494     677     540 0
+.6464623148
10       4096:    1012  98.83  98.17    3758     511    1000    2247 0
+.9460533270

\$ ./analyze_bloom.pl 12  5  15
NumVecs=5
avg      avg   calc  ----------------- # collisions --
+---------------
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+   4       5      p(coll)
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+------ ------- ------------
12        256:     248   6.06   6.06      37      36       1       0
+     0       0 0.0000008164
12        512:     479  11.70  11.75     138     114      23       1
+     0       0 0.0000223999
12       1024:     908  22.17  22.12     434     311     103      17
+     3       0 0.0005295634
12       2048:    1604  39.17  39.35    1296     634     447     170
+    44       1 0.0094309292
12       4096:    2572  62.81  63.21    3282     915    1038     788
+   443      98 0.1009251903
12       8192:    3542  86.47  86.47    7372     979    1392    1775
+  1968    1258 0.4833243641
12      16384:    4020  98.15  98.17   15564     980    1420    2052
+  3718    7394 0.9117155503

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re: [OT] The statistics of hashing. by roboticus
in thread [OT] The statistics of hashing. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2017-06-24 06:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How many monitors do you use while coding?

Results (556 votes). Check out past polls.