Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: [OT] The statistics of hashing.

by flexvault (Monsignor)
on Apr 01, 2012 at 13:52 UTC ( [id://962884]=note: print w/replies, xml ) Need Help??


in reply to [OT] The statistics of hashing.

BrowserUk,

This is an observation sometime from 2005-2008. My handle on PM is because of working on this project at the time I joined, but I have little to due with the project now. It was a project to backup Terabytes of encrypted files to geographically distant off-site computer centers using 'rsync' in close to 'real-time'. The client's goal was that 'rsync' not work with the real files, but only with encrypted files which sort of negates the value of using 'rsync'. While we looked at several methods for getting a value for naming the encrypted files, we ran tests of several different sets of 1 million encrypted files and the following table is an average.

Technique Duplicates MD5 5,701 crc32 26,323 SimCRC64 17 Simple 8 byte ^ of file (very fast +)

I can't say these results would be the same anywhere else, but when I started testing, I was sure that MD5 would be the most unique, and that's why we ran the tests several times with different file sets. I don't know if the encryption algorithm affected the results and don't know it the unencrypted files would give similiar results or not. I just didn't expect the results!

I only bring this up because I think you started out with the same assumption about MD5, but maybe there is a better starting point and maybe not.

Thank you

"Well done is better than well said." - Benjamin Franklin

Replies are listed 'Best First'.
Re^2: [OT] The statistics of hashing.
by BrowserUk (Patriarch) on Apr 01, 2012 at 16:23 UTC
    I only bring this up because I think you started out with the same assumption about MD5, ...

    That is a very good point.

    The distribution of any (perfect(*)) hashing algorithm will only be fully utilised if the inputs can take on all possible values for any given input length.

    For example, ASCII or ISO text, certain parts of the input range -- many of the control characters <32; and characters >126 -- will not appear in the inputs, so for any given length, the range of inputs to the hashing algorithm is biased. It stand to reason that the outputs will similarly biased and the full range of the hash values will not be produced.

    It is also the case that for any given input language, there are some character combinations that will never appear in normal text -- eg. ZXZXZXZXZXZXZXZXZXZ -- hence the inputs, and thus outputs will be further biased.

    For this particular part of the exercise, I'm more concerned with the collision detection mechanism than the vagaries of the hashing.

    But thank you for raising the issue. I may well contact you later for the details of your SimCRC64 algorithm.

    (*)Note: I am aware that MD5 is probably far from perfect.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      BrowserUk,

        ...ASCII or ISO text, certain parts of the input range -- many of the control characters <32; and characters >126 -- will not appear...
      I hadn't thought about it but since the files were encrypted, more character sequences might appear.

      As far as the 'SimCRC64' algorithm, it was just a simple CRC routine. I don't have the actual code, but it was something like this.

      my $crc = crc64(\$output); ## Add crc check to output data sub crc64 ## Add crc check to output data { my $data = shift; my $crc = ""; my $len = length($$data); my $pos + = 0; while ( $pos < $len ) { $crc = $crc ^ ( substr($$data,$pos,8) ); $pos += 8; } return ( $crc ); }

      It was part of the test to see what the worst case would be. As I previously said, I didn't expect the results.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

        I don't have the actual code, but it was something like this.

        Thanks for posting that.

        However, for my purposes -- deterministically generating a randomly distributed value -- it doesn't work well. With the MD5, every single bit change in the input generates an average of 64-bits change in the output -- which is near perfect.

        But this crc algorithm -- which was probably great for your purposes -- only generates a single bit change in the output for each single bit change in the input:


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962884]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-19 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found