Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: [OT] The statistics of hashing. (4<10)

by tye (Sage)
on Apr 01, 2012 at 19:00 UTC ( #962910=note: print w/replies, xml ) Need Help??


in reply to Re^2: [OT] The statistics of hashing. (birthday)
in thread [OT] The statistics of hashing.

You appear to have not fully understood several parts of what I wrote. (Which is not an insult.)

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

That actually is not a valid conclusion from anything I wrote.

The odds of the next insert being a collision after about 1.5e9 inserts is about (1.5e9/2**32)**4 or under 1.5%, and 98.5% is not "miniscule" (and about 1/100th as fast is not "almost the same rate"). But that is also ignoring that there won't actually be 1.5e9 bits set. But you make no mention of how many bits are set which is how you'd get an accurate calculations so I'm curious if you perhaps didn't fully appreciate that point (as opposed to some other reason like it perhaps not being easy to query the number of bits set in your current run, etc.).

Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

I suspect that is the case in the numbers you show it your reply above, because you don't show collisions happening on the order of every 1e2 inserts. With 10 hashes, the above odds go to around 2.7e-5. And that is in the ballpark of what you report seeing.

If you need some help redoing the calculations for 10 hashes instead of 4, then please speak up. I was not exhaustive in my explanations nor expansions of formulae, so I would expect more clarification could be required. I can also understand you not taking the time to seriously study what I wrote yet given how the numbers seem so far off from your observations.

- tye        

  • Comment on Re^3: [OT] The statistics of hashing. (4<10)

Replies are listed 'Best First'.
Re^4: [OT] The statistics of hashing. (4<10)
by BrowserUk (Pope) on Apr 01, 2012 at 19:39 UTC
    You appear to have not fully understood several parts of what I wrote.

    You are right. In as much as, even re-reading your post in its entirety, I fail to understand what this table of odds represents, if not your calculation for the use of 4x32-bit hashes:

    Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6
    Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

    The (still) running program is using 10 hashes. But I am not yet convinced that the extra 6 hashes actually buys anything. See below.

    The program has currently covered just under half of the maximum 2**32 trials and has found only 8359 collisions. The last 10 of which were at positions:

    1933560856 1933561794 1933600874 1933619074 1933633177 1933665039 1933675171 1933675769 1933690902 1933708679

    I am reluctant to interrupt it before it reaches at least half way, and -- depending on the rate at which the rate of collisions increases, may let it run on to 3 billion. Hence, I cannot consider adding code to count the current bit set per vector as you described at this stage.

    10 hashes

    As described in my OP, I have (artificially) expanded the number of subhashes, by permuting the 4 natural 32-bit sub-divisions of the MD5 and XORing them to produce the other 6.

    But, as moritz suggests here, it is not at all clear that this process actually decreases the odds of collisions, because it doesn't increase the amount of information (he calls it entropy, but I'm uncomfortable with that term in this context), available.

    It is my intention to verify that during the second pass (to isolate how many of the collisions are actual dups and how many are false positives). At the same time, I will also perform the collision check using just the 4 natural hashes and that will tell me whether the permuted XORs buy me anything or not. But that will take another week or so.

    If you need some help redoing the calculations

    If you could distill your write-up to a formula, that would be helpful.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      If you could distill your write-up to a formula, that would be helpful.

      I already did that for the most useful and most accurate calculation (as well as for other parts):

      It is just $prod += $odds-$pro­d*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

      Looking at what is not explicit there, the only thing I find moderately difficult to determine is that you can start with $prod initialized to 0. The value being calculated is the probability (a value between 0 and 1) that there has been at least one 4-hash collision (due to random chance) so far.

      The upper bound on this value that I pre-calculated by discounting the impact of single-bit collisions is achieved by simply assuming $b1 through $b4 are all equal to the number of insertions done so far.

      This is substantially accurate for quite a while. Making it more accurate is seriously difficult other than by computing run-specific odds as is done by the two formulae I quoted above.

      I find it difficult to predict how tightly this upper bound is likely to fit against some run-specific odds when the number of inserts is a significant fraction of the number of bits per hash, as you are trying to do.

      Your 10 hash-values via XOR is not something that I can predict the results of, other than that I expect it to lie somewhere between the two very remote extremes of "no better than 4 independent hash values" and "just as good as 10 fully independent hash values". Your results so far seem to be somewhat close to what I would expect if you have 10 fully independent hash values per string.

      So my calculations completely ignored the "10 hashes" addition to portion of your description as I see no way to accurately address it.

      If I wanted to gain additional buckets, then I would compute the additional hash values using a (moderately) cryptographically sound process. For example, you could compute md5($string) and md5('BrowserUk'.$string) to get 8 hashes that should be substantially independent (based on prior difficult analysis of the MD5 algorithm). Though, your experiment may prove that your strategy is sufficient.

      - tye        

        If you could distill your write-up to a formula, that would be helpful. -- I already did that

        I guess that your idea of a formula and mine are different.

        where $b1 is the number of bits set in the first vector.

        The "number of bits set" in each vector at any given insert is entirely dependent upon not just how many, but what values have already been inserted.

        Which means that to use your description to calculate the probabilities, I would need to iterate the entire thing and count the number of bits set in each vector at each iteration. And then those "calculated probabilities" would only be applicable to that particular sequence of inserts.

        At which point, I might just as well just record the actual numbers of false positives, rather than calculate them.

        You'll no doubt come back and tell me that I've completely misunderstood you (again), and that everything I need to know is right there if I would only read you correctly.

        Thank you for your attempts to assist me.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962910]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2018-11-15 03:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (180 votes). Check out past polls.

    Notices?