Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Pure perl Jenkins 32 bit Hash

by huck (Prior)
on Nov 16, 2017 at 13:00 UTC ( [id://1203588]=perlquestion: print w/replies, xml ) Need Help??

huck has asked for the wisdom of the Perl Monks concerning the following question:

Below is a pure perl version of the lookup2 hash by Bob Jenkins as talked about at A Hash Function for Hash Table Lookup.

Edit: This is an updated version, in particular the mix4 mentioned at Re: Pure perl Jenkins 32 bit Hash is now called mix4x and may be discarded. Also there is code to select the proper version based on $Config{ivsize} and a test call.

{ # has use integer/bytes use integer; use bytes; # http://www.perlmonks.org/?node_id=315881 # http://burtleburtle.net/bob/c/lookup2.c # http://burtleburtle.net/bob/hash/doobs.html # http://search.cpan.org/~shlomif/Digest-JHash/lib/Digest/JHash.pm # http://cpansearch.perl.org/src/SHLOMIF/Digest-JHash-0.10/JHash.xs */ + use constant GOLDEN_RATIO => 0x9e3779b9; use constant A => 0; use constant B => 1; use constant C => 2; use constant FFFFFFFF => 0xffffffff; use constant KEY => 0; use constant INITHASH => 1; sub mix4 ($$$) { # 32bit version # per http://www.perlmonks.org/?node_id=1203705 this is a revised 32bi +t under 'use integer'; $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>13) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<< 8) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>13) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>12) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<16) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>> 5) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>> 3) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<10) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>15) +; } } sub mix4x ($$$) { # per http://www.perlmonks.org/?node_id=1203705 this is wrong $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>13); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<< 8); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>13); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>12); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<16); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>> 5); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>> 3); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<10); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>15); } sub mix8 ($$$) { # 64bit version $_[A] &= FFFFFFFF; $_[B] &= FFFFFFFF; $_[C] &= FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>>13) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<< 8) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>>13) ) & + FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>>12) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<<16) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>> 5) ) & + FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>> 3) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<<10) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>>15) ) & + FFFFFFFF; } sub jhash_pp_hex { my ($a, $b, $c) = ( GOLDEN_RATIO, GOLDEN_RATIO, $_[INITHASH] ); my ($p, $length) = (0, length $_[KEY]); my $len=$length; my($x,$y,$z); while ($len>=12) { ($x,$y,$z) = unpack 'LLL', substr($_[KEY], $p, 12); $a+=$x;$b+=$y;$c+=$z; mix($a, $b, $c); $p += 12; $len-=12; } # even if len==0 we need another round to mix in the length ($x,$y,$z) = unpack 'LLL', substr($_[KEY] . (chr(0)x12), $p, 1 +2); $z<<=8; # /* the first byte of c is reserved for the length * +/ $z+=$length; $a+=$x;$b+=$y;$c+=$z; mix($a, $b, $c); my $hex = unpack("H*", pack("N", $c)); return $hex; } # jhash_pp_hex use Config; if ( $Config{ivsize} == 4 ) { *main::mix=*main::mix4; } else { *main::mix=*main::mix8; } } # has use integer/bytes print jhash_pp_hex('abcdef',0)."\n";

I had a situation where i could not use Digest::JHash because i did not have access to a compiler.

I needed to hash filenames for a filetracking program using a mysql database. Rather than carry the filename in 4+ tables i use a unique fnid assigned in a fnid table that holds the filename in a text bucket.

As you cannot really index a text field i was originally using md5 on the filename as a key. I knew there can be collisions but that to select a single filename SELECT fnid FROM fnid WHERE fnmd5=? and fn=? would do ok if fnmd5 was a index.

But md5 in hex is 32 chars, and that was real big, bigger than i needed. So i searched for 32 bit hashs, found the jenkins variants and found Re: Fast string hash in portable perl? [DO NOT USE!] by BrowserUk so i decided to try that. My disappointment will be described in a separate reply thread about the history below [history] Pure perl Jenkins 32 bit Hash.

This also calls into question what to do about Digest::JHash, again i will use a separate reply thread [Digest::JHash problem] Pure perl Jenkins 32 bit Hash to talk about that.

Overall i am happy with this. I realize there are newer jenkins and other 32bit hashs, but this will do. It reduced the phpmyadmin reported size of the fnid table by 50%. It is fast enough for my needs, it seems to be the hash perl uses for its hashs.

Replies are listed 'Best First'.
[history] Pure perl Jenkins 32 bit Hash
by huck (Prior) on Nov 16, 2017 at 13:01 UTC

    As mentioned in the OP i needed a 32bit hash and searched around. I decided id show how i ended up at my solution as well.

    At the bottom of this post are 4 source files. Assume the perl program is called hashtest1.pl

    When i found Re: Fast string hash in portable perl? [DO NOT USE!] i decided to try it. I was rather dismayed, Jenkins was supposed to be better than this. (Hits represents the number of filenames in the bucket. )

[Digest::JHash problem] Pure perl Jenkins 32 bit Hash
by huck (Prior) on Nov 16, 2017 at 13:02 UTC

    So it is clear that Digest::JHash has a problem in that it doesnt faithfully reproduce lookup2.c like it thinks it does. I think i should report my findings to them.

    But should they "fix" it? If they do, the new version while now returning the results lookup2.c does will no longer always return the same results 0.10 did. This would be a huge problem to users of the module, some of their hashs have now changed.

    In thinking about it, i dont think the negative rather than positive additions of the chars makes it "less hashy", just differnt from lookup2.c.

    What would you do if faced with this dilemma?

Re: Pure perl Jenkins 32 bit Hash
by Anonymous Monk on Nov 17, 2017 at 19:52 UTC

    Are you sure the 32-bit mix4 is correct? Doing low-level unsigned arithmetic can be awkward in perl.

    ... $a -= $b; $a -= $c; no integer; $a ^= ($c >> 13); use integer; ...

      Are you sure the 32-bit mix4 is correct?

      Guess not

      $Config{api_versionstring}:5.14.0 $Config{use64bitint}: $Config{ivsize}:4 $Config{byteorder}:1234 $Config{i16size}:2 $Config{i32size}:4 $Config{i64size}:8 $Config{osname}:MSWin32 $Config{osvers}:5.2 $ENV{PROCESSOR_ARCHITECTURE}:X86 orig : 11111111000000001111111111111111 ff00ffff use integer:>>16: 11111111111111111111111100000000 ffffff00 no integer :>>16: 00000000000000001111111100000000 0000ff00 orig : 00001111000000001111111111111111 0f00ffff use integer:>>16: 00000000000000000000111100000000 00000f00 no integer :>>16: 00000000000000000000111100000000 00000f00

      Doing low-level unsigned arithmetic can be awkward in perl.

      Seems so, this needs more investigation and i dont have a C compiler for that 32bit machine

      Thanks for the heads up, I will keep looking deeper

      Added:

      Re: How to do 'unsigned shift right' in perl? Note that both "<<" and ">>" in Perl are implemented directly using "<<" and ">>" in C. If use integer (see Integer Arithmetic) is in force then signed C integers are used, else unsigned C integers are used.

      Added2:

      Wow

      Ive got the "fix", just like you said, but understanding why it works and "-hash hash3" still doesnt work on 32bit and how mixing IV,UV and possibly NV arithmetic between mix and hash3 is causing the difference hasnt quite gelled yet. I'll be back! Devel::Peek and perlguts will rescue me!

      Doing low-level unsigned arithmetic can be awkward in perl.

      yupo

      ok so i developed an overly debugged version of mix and the revised mix4 and ran it on the 32 bit box

      # this is inside a 'use integer;' block sub mix4 ($$$) { # per http://www.perlmonks.org/?node_id=1203705 this is a revised 32bi +t under 'use integer'; $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>13) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<< 8) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>13) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>12) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<16) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>> 5) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>> 3) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<10) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>15) +; } }

      This is mix

      # there is no 'use integer;' here; sub mix ($$$) { $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>13); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<< 8); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>13); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>12); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<16); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>> 5); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>> 3); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<10); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>15); }
      and the results are very interesting

      during the first modification of C mix4exp shows (it is best to look at the following using download version without wrap)

      a1-x c>13 : A : 59743 0000e95f b1-x a< 8 : B : - -162 +9499560 9edfcf58 c1-a : C : - + - -59741 ffff16a3 SV = IV(0xc5f738) at 0xc5f73c REFCNT = 1 FLAGS = (PADMY,IOK,pIOK) IV = -59741 c1-b : C : - + - 1629439819 611f474b SV = IV(0xc5f738) at 0xc5f73c REFCNT = 1 FLAGS = (PADMY,IOK,pIOK) IV = 1629439819 c1-x b>13 : orig 1 : 01100 +001000111110100011101001011 611f474b : orig 2 : 10011 +110110111111100111101011000 9edfcf58 : >>13 : 00000 +000000001001111011011111110 0004f6fe : res : 01100 +001000110111011000110110101 611bb1b5 c1-x b>13 : C : - + - 1629204917 611bb1b5 while mixexp shows a1-x c>13 : A : 59743 0000e95f b1-x a< 8 : B : - -162 +9499560 9edfcf58 c1-a : C : - + - -59741 ffff16a3 SV = IV(0xc5f418) at 0xc5f41c REFCNT = 1 FLAGS = (PADMY,IOK,pIOK) IV = -59741 c1-b : C : - + - -2147483648 80000000 SV = PVNV(0xc7355c) at 0xc5f41c REFCNT = 1 FLAGS = (PADMY,NOK,pNOK) IV = -59741 NV = -2665527477 PV = 0 c1-x b>13 : orig 1 : 10000 +000000000000000000000000000 80000000 : orig 2 : 10011 +110110111111100111101011000 9edfcf58 : >>13 : 00000 +000000001001111011011111110 0004f6fe : res : 10000 +000000001001111011011111110 8004f6fe c1-x b>13 : C : - + - -2147158274 8004f6fe
      The Dump calls are Dump($_[C]);

      Notice that in mix4exp the Dump shows C stays IV, and shows the proper result of mod(32) arithmetic with A and B .

      But notice how in mixexp the Dump shows that after -=A C has stayed IV, with the proper result, but that after -=B C overflowed the IV creating a NV and when it comes to the shift portion it is an overflow indicator (0x80000000) that gets shifted. Of course everything goes to L in a handbasket after that.

      Doing low-level unsigned arithmetic can be awkward in perl.

      yupo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1203588]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-19 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found