There's more than one way to do things PerlMonks

### Re: [OT] Statistics question.

by roboticus (Chancellor)
 on Jan 30, 2013 at 05:10 UTC ( #1015962=note: print w/replies, xml ) Need Help??

in reply to [OT] Statistics question.

I'm pretty sure that this is the same math we played with in Re: [OT] The statistics of hashing.. I don't know how to compute the standard deviation, though. I'll have to do a bit of reading and see what I can come up with. But judging from the results from that thread, I'd expect there to be minimal overlap between two sets of 1e6 bits in 4e9 possibilities.

Update: I found my ana_2.pl script, but since I'm on a 32-bit machine, I couldn't run it with 2^32 bit vectors. (I really need to stand up a 64-bit OS and perl one day.) But I ran it with a million samples in a pair of vectors of various sizes (2^24, 2^26, 2^28, 2^30 and 2^31) and it looks like collisions shouldn't be very frequent, judging from the progression:

\$ ./ana_2.pl 24 2 1000000 N=16777216, V=2, X=1000000 integral(1000000)=25166956.740071, integral +(0)=25165824 Expected collisions: 1132.74007095024 \$ ./ana_2.pl 26 2 1000000 N=67108864, V=2, X=1000000 integral(1000000)=100663369.193409, integra +l(0)=100663296 Expected collisions: 73.1934093236923 \$ ./ana_2.pl 28 2 1000000 N=268435456, V=2, X=1000000 integral(1000000)=402653188.613027, integr +al(0)=402653184 Expected collisions: 4.61302703619003 \$ ./ana_2.pl 30 2 1000000 N=1073741824, V=2, X=1000000 integral(1000000)=1610612736.28892, integ +ral(0)=1610612736 Expected collisions: 0.288918733596802 \$ ./ana_2.pl 31 2 1000000 N=2147483648, V=2, X=1000000 integral(1000000)=3221225472.07225, integ +ral(0)=3221225472 Expected collisions: 0.0722546577453613

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: [OT] Statistics question.
by BrowserUk (Pope) on Jan 30, 2013 at 12:02 UTC

I could see that there was some relationship between the two, but (despite my promise), I never got comfortable enough with the math from that thread to let me see how to apply it to this (unrelated) task.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

I know what you mean. Prob & Stat are always confusing to me, and I always have to slog through some reading before I feel comfortable with writing anything about it.

I found how to compute the standard deviation, but either I'm doing it wrong (most likely), or it's pretty useless because of loss of precision (For the expected value, we subtract one huge number from another and magically get the result. It seems the same occurs with the standard deviation but the numbers are *much* larger, so for interesting problem sizes, the interesting bits get pushed off the right end of the float).

I currently don't have anything useful for standard deviation yet...

Update: I like the idea of knowing the standard deviation, though, as it may make it possible to estimate of the number of "positive matches" from your document & bloom filter project. ("Hmmm, we expect 200 false positives with a variance of 20, but we're seeing 400, so we've probably got about 180 to 220-ish matches.")

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Are your *much* larger numbers perfect squares? If so, you can avoid the problem by factoring into the sum and difference of their square roots.
Bill
Update: I like the idea of knowing the standard deviation, ...

For this problem -- testing a sparse bitvector implementation -- moritz' simplified calculation appears to be 'good enough' for my purpose. I'm not capable of assessing how applicable it would be to the math you came up with for my multi-vector hashing (ie. weird bloom filter) project.

However, there is one way that these two projects may become connected. In that, if my sparse bitvector implementation proves to be sufficiently speedy, I may recode the multi-vector hashing algorithm to use it because: a) it would great;y reduce the memory requirement; b) it would opne up the possibility of increasing the discrimination by using more than 10 vectors. (But that's a project for another day :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: [OT] Statistics question.
by pvaldes (Chaplain) on Jan 30, 2013 at 19:25 UTC

The overlapping depends a lot of the size of the sample. To find the standard deviation of each population is easy:

use Statistics::Basic qw(stddev); #generate a lot of random numbers, by the preferred way, ie: for (1..1000){ \$number1= int(rand(100000000)); push @listofnumbers1, \$number1;} my \$stddev1 = stddev(@listofnumbers1); #repeat the process push @listofnumbers2, \$number2; my \$stddev2 = stddev(@listofnumbers2);

Create A New User
Node Status?
node history
Node Type: note [id://1015962]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2018-02-23 12:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When it is dark outside I am happiest to see ...

Results (302 votes). Check out past polls.

Notices?