in reply to
Re^2: [OT] The statistics of hashing.

in thread [OT] The statistics of hashing.

Thank you roboticus.

I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning!

`#! perl -slw
use strict;
sub ex_10 {
my ($N, $X) = @_;
return -$N/10*exp(-10*$X/$N)
+ 10*$N/9 *exp( -9*$X/$N)
- 45*$N/8 *exp( -8*$X/$N)
+ 120*$N/7 *exp( -7*$X/$N)
- 35*$N *exp( -6*$X/$N)
+ 252*$N/5 *exp( -5*$X/$N)
- 105*$N/2 *exp( -4*$X/$N)
+ 40*$N *exp( -3*$X/$N)
- 45*$N/2 *exp( -2*$X/$N)
+ 10*$N *exp( -1*$X/$N)
+ $X;
}
my $exp10_0 = ex_10( 2**32, 0 );
open I, '<', 'L25B32V10I32S1.posns' or die $!;
while( <I> ) {
my( $inserts, $collisions ) = split;
my $exp3 = ex_10( 2**32, $inserts ) - $exp10_0;
printf "Insert %10d: Predicted %6d Actual: %6d Delta: %+.3f%%\n"
+,
$inserts, $exp3, $collisions, ( $collisions - $exp3 ) / $exp3
+* 100;
}
__END__
C:\test>rbtcs-form-verify
Insert 779967210: Predicted 1 Actual: 1 Delta: -18.051%
Insert 782382025: Predicted 1 Actual: 2 Delta: +58.813%
Insert 830840395: Predicted 2 Actual: 3 Delta: +29.292%
Insert 882115420: Predicted 4 Actual: 4 Delta: -5.961%
Insert 883031614: Predicted 4 Actual: 5 Delta: +16.325%
Insert 897571477: Predicted 5 Actual: 6 Delta: +18.390%
Insert 923155269: Predicted 6 Actual: 7 Delta: +4.086%
Insert 948108745: Predicted 8 Actual: 8 Delta: -8.996%
Insert 954455113: Predicted 9 Actual: 9 Delta: -4.244%
Insert 967783959: Predicted 10 Actual: 10 Delta: -7.404%
Insert 988381482: Predicted 13 Actual: 11 Delta: -17.487%
Insert 992691311: Predicted 13 Actual: 12 Delta: -13.814%
Insert 995935158: Predicted 14 Actual: 13 Delta: -9.624%
Insert 1011301141: Predicted 16 Actual: 14 Delta: -16.457%
Insert 1013742872: Predicted 17 Actual: 15 Delta: -12.616%
Insert 1022258193: Predicted 18 Actual: 16 Delta: -14.242%
Insert 1031874023: Predicted 20 Actual: 17 Delta: -16.989%
Insert 1034026887: Predicted 20 Actual: 18 Delta: -13.909%
Insert 1036254774: Predicted 21 Actual: 19 Delta: -11.051%
Insert 1037064360: Predicted 21 Actual: 20 Delta: -7.093%
Insert 1037193945: Predicted 21 Actual: 21 Delta: -2.569%
Insert 1037309710: Predicted 21 Actual: 22 Delta: +1.957%
...( ~ 52000 ellided }
Insert 2375752842: Predicted 52209 Actual: 52277 Delta: +0.130%
Insert 2375755671: Predicted 52209 Actual: 52278 Delta: +0.131%
Insert 2375756509: Predicted 52209 Actual: 52279 Delta: +0.133%
Insert 2375760656: Predicted 52210 Actual: 52280 Delta: +0.133%
Insert 2375763928: Predicted 52211 Actual: 52281 Delta: +0.134%
Insert 2375785238: Predicted 52215 Actual: 52282 Delta: +0.128%
Insert 2375788721: Predicted 52215 Actual: 52283 Delta: +0.128%
Insert 2375789878: Predicted 52216 Actual: 52284 Delta: +0.130%
Insert 2375790896: Predicted 52216 Actual: 52285 Delta: +0.131%
Insert 2375798283: Predicted 52217 Actual: 52286 Delta: +0.131%
`

And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there.

My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass:

`C:\test>rbtcs-bloom-probb 32 10 30e8
N=4294967296, V=10, X=30e8 integral(30e8)=12580199467.653, integral(0)
+=12579822861.8159
Expected collisions: 376605.837186813
`

That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run.

The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently.

And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".