BrowserUk has asked for the
wisdom of the Perl Monks concerning the following question:
I'm testing a piece of perl code (tenuous justifiction :)) that has billions of possibilities and thus is impractical to test exhaustively.
Therefore I want o test statistically, and to that end my question:
Given a (near) perfect PRNG, if I pick two sets of 1e6 random 32bit unsigned integers:
 How much overlap, (how many dups), should I expect between the two sets?
 With what standard deviation?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: [OT] Statistics question. by roboticus (Canon) on Jan 30, 2013 at 05:10 UTC 
BrowserUk:
I'm pretty sure that this is the same math we played with in Re: [OT] The statistics of hashing.. I don't know how to compute the standard deviation, though. I'll have to do a bit of reading and see what I can come up with. But judging from the results from that thread, I'd expect there to be minimal overlap between two sets of 1e6 bits in 4e9 possibilities.
Update: I found my ana_2.pl script, but since I'm on a 32bit machine, I couldn't run it with 2^32 bit vectors. (I really need to stand up a 64bit OS and perl one day.) But I ran it with a million samples in a pair of vectors of various sizes (2^24, 2^26, 2^28, 2^30 and 2^31) and it looks like collisions shouldn't be very frequent, judging from the progression:
$ ./ana_2.pl 24 2 1000000
N=16777216, V=2, X=1000000 integral(1000000)=25166956.740071, integral
+(0)=25165824
Expected collisions: 1132.74007095024
$ ./ana_2.pl 26 2 1000000
N=67108864, V=2, X=1000000 integral(1000000)=100663369.193409, integra
+l(0)=100663296
Expected collisions: 73.1934093236923
$ ./ana_2.pl 28 2 1000000
N=268435456, V=2, X=1000000 integral(1000000)=402653188.613027, integr
+al(0)=402653184
Expected collisions: 4.61302703619003
$ ./ana_2.pl 30 2 1000000
N=1073741824, V=2, X=1000000 integral(1000000)=1610612736.28892, integ
+ral(0)=1610612736
Expected collisions: 0.288918733596802
$ ./ana_2.pl 31 2 1000000
N=2147483648, V=2, X=1000000 integral(1000000)=3221225472.07225, integ
+ral(0)=3221225472
Expected collisions: 0.0722546577453613
...roboticus
When your only tool is a hammer, all problems look like your thumb.  [reply] [d/l] 

I could see that there was some relationship between the two, but (despite my promise), I never got comfortable enough with the math from that thread to let me see how to apply it to this (unrelated) task.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

BrowserUk:
I know what you mean. Prob & Stat are always confusing to me, and I always have to slog through some reading before I feel comfortable with writing anything about it.
I found how to compute the standard deviation, but either I'm doing it wrong (most likely), or it's pretty useless because of loss of precision (For the expected value, we subtract one huge number from another and magically get the result. It seems the same occurs with the standard deviation but the numbers are *much* larger, so for interesting problem sizes, the interesting bits get pushed off the right end of the float).
I currently don't have anything useful for standard deviation yet...
Update: I like the idea of knowing the standard deviation, though, as it may make it possible to estimate of the number of "positive matches" from your document & bloom filter project. ("Hmmm, we expect 200 false positives with a variance of 20, but we're seeing 400, so we've probably got about 180 to 220ish matches.")
...roboticus
When your only tool is a hammer, all problems look like your thumb.
 [reply] 



 
use Statistics::Basic qw(stddev);
#generate a lot of random numbers, by the preferred way, ie:
for (1..1000){
$number1= int(rand(100000000));
push @listofnumbers1, $number1;}
my $stddev1 = stddev(@listofnumbers1);
#repeat the process
push @listofnumbers2, $number2;
my $stddev2 = stddev(@listofnumbers2);
 [reply] [d/l] 
Re: [OT] Statistics question. by moritz (Cardinal) on Jan 30, 2013 at 09:18 UTC 
I'll do a small simplification in order to use a much simpler model: I assume that we have one list (duplicates allowed) and one set (no duplicates allowed).
Then for each member of the list, the probability of having a match in the set is P(1) = 1e6/2**32.
Since we've assumed a list, all the probabilities of having matches are independent, and the expectation value is simply 1e6 * P(1) = 1e6 * 1e6/2**32 = 232.83.
If the number of matches is a Poisson distribution (and I suspect it is, in this example), then the standard deviation is simply the square root of the expectation value, so 15.5.
It is hard for me to estimate how big an error I've made by this simplification; I'll update the node if I get an idea of how to estimate it.
 [reply] 

C:\test>bitvec2 N=100
100
Mean: 230.95 stddev:14.75
I'm doing a run of 1000 samples, and if that shows no surprises, I'll be taking your figures as read and basing my testing upon it.
Thank you.
Update: The 1000 sample run marries well (good enough): C:\test>bitvec2 N=1000
1000
Mean: 233.39 stddev:15.79
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] [select] 

