|Just another Perl shrine|
Re^4: [OT] The statistics of hashing. (4<10)by BrowserUk (Pope)
|on Apr 01, 2012 at 19:39 UTC||Need Help??|
You appear to have not fully understood several parts of what I wrote.
You are right. In as much as, even re-reading your post in its entirety, I fail to understand what this table of odds represents, if not your calculation for the use of 4x32-bit hashes:
Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.
The (still) running program is using 10 hashes. But I am not yet convinced that the extra 6 hashes actually buys anything. See below.
The program has currently covered just under half of the maximum 2**32 trials and has found only 8359 collisions. The last 10 of which were at positions:
I am reluctant to interrupt it before it reaches at least half way, and -- depending on the rate at which the rate of collisions increases, may let it run on to 3 billion. Hence, I cannot consider adding code to count the current bit set per vector as you described at this stage.
As described in my OP, I have (artificially) expanded the number of subhashes, by permuting the 4 natural 32-bit sub-divisions of the MD5 and XORing them to produce the other 6.
But, as moritz suggests here, it is not at all clear that this process actually decreases the odds of collisions, because it doesn't increase the amount of information (he calls it entropy, but I'm uncomfortable with that term in this context), available.
It is my intention to verify that during the second pass (to isolate how many of the collisions are actual dups and how many are false positives). At the same time, I will also perform the collision check using just the 4 natural hashes and that will tell me whether the permuted XORs buy me anything or not. But that will take another week or so.
If you need some help redoing the calculations
If you could distill your write-up to a formula, that would be helpful.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.