Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^5: [OT] The statistics of hashing. (formula)

by tye (Sage)
on Apr 01, 2012 at 21:32 UTC ( #962935=note: print w/replies, xml ) Need Help??


in reply to Re^4: [OT] The statistics of hashing. (4<10)
in thread [OT] The statistics of hashing.

If you could distill your write-up to a formula, that would be helpful.

I already did that for the most useful and most accurate calculation (as well as for other parts):

It is just $prod += $odds-$pro­d*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

Looking at what is not explicit there, the only thing I find moderately difficult to determine is that you can start with $prod initialized to 0. The value being calculated is the probability (a value between 0 and 1) that there has been at least one 4-hash collision (due to random chance) so far.

The upper bound on this value that I pre-calculated by discounting the impact of single-bit collisions is achieved by simply assuming $b1 through $b4 are all equal to the number of insertions done so far.

This is substantially accurate for quite a while. Making it more accurate is seriously difficult other than by computing run-specific odds as is done by the two formulae I quoted above.

I find it difficult to predict how tightly this upper bound is likely to fit against some run-specific odds when the number of inserts is a significant fraction of the number of bits per hash, as you are trying to do.

Your 10 hash-values via XOR is not something that I can predict the results of, other than that I expect it to lie somewhere between the two very remote extremes of "no better than 4 independent hash values" and "just as good as 10 fully independent hash values". Your results so far seem to be somewhat close to what I would expect if you have 10 fully independent hash values per string.

So my calculations completely ignored the "10 hashes" addition to portion of your description as I see no way to accurately address it.

If I wanted to gain additional buckets, then I would compute the additional hash values using a (moderately) cryptographically sound process. For example, you could compute md5($string) and md5('BrowserUk'.$string) to get 8 hashes that should be substantially independent (based on prior difficult analysis of the MD5 algorithm). Though, your experiment may prove that your strategy is sufficient.

- tye        

  • Comment on Re^5: [OT] The statistics of hashing. (formula)

Replies are listed 'Best First'.
Re^6: [OT] The statistics of hashing. (formula)
by BrowserUk (Pope) on Apr 01, 2012 at 21:56 UTC
    If you could distill your write-up to a formula, that would be helpful. -- I already did that

    I guess that your idea of a formula and mine are different.

    where $b1 is the number of bits set in the first vector.

    The "number of bits set" in each vector at any given insert is entirely dependent upon not just how many, but what values have already been inserted.

    Which means that to use your description to calculate the probabilities, I would need to iterate the entire thing and count the number of bits set in each vector at each iteration. And then those "calculated probabilities" would only be applicable to that particular sequence of inserts.

    At which point, I might just as well just record the actual numbers of false positives, rather than calculate them.

    You'll no doubt come back and tell me that I've completely misunderstood you (again), and that everything I need to know is right there if I would only read you correctly.

    Thank you for your attempts to assist me.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      You'll no doubt come back and tell me that I've completely misunderstood you (again)

      I never said that you completely misunderstood me. But I'll take your apparent lack of interest and leave it at that.

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962935]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2019-10-18 01:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?