If it were not for pfaut's unpack suggestion, I almsot certainly would be using it, but the loss of performance of the perl implimentation relative to the C switch version means pushing as much of the work into perl as I can has dividends.
The original C code is here.
My current best perl implementation of the while thing is the hash2() in the test code below. hash() being my original brute force conversion. The two version appear to function the same as each other, with the unpack version coming out about %50 faster.
However, the number of first pass collisions worries me a little. It could be just be a function of the limited range of inputs, but its higher than I expected. The other possibility is that both implementations are equally wrong. If you (or anyone) has a few moments to compare the perl and the C versions and point out any obvious differences I'd be grateful.
Also, if anyone has any idea's or pointers for methods of testing the goodness of this type of hashing function I'd really like to hear/see them. Unfortunately, the vast majority of the hits on google relate to hashing fuctions used for cryptography which is a completly different ballpark. I'm slowly whittling them down and wading through them, but haven't found anything applicable yet.
My test code
#! perl -slw
use strict;
use constant GOLDEN_RATIO => 0x9e3779b9;
use constant A => 0;
use constant B => 1;
use constant C => 2;
sub mix ($$$) {
$_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>13);
$_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<< 8);
$_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>13);
$_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>12);
$_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<16);
$_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>> 5);
$_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>> 3);
$_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<10);
$_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>15);
}
sub hash {
my ($key, $init) = @_;
my @k = unpack 'C*', $key;
my $len = length $key;
my ($a, $b, $c) = (GOLDEN_RATIO, GOLDEN_RATIO, $init);
while($len >=12) {
$a += ($k[0] + ($k[1]<<8) + ($k[2]<<16) + ($k[3]<24));
$b += ($k[4] + ($k[5]<<8) + ($k[6]<<16) + ($k[7]<24));
$a += ($k[8] + ($k[9]<<8) + ($k[10]<<16) + ($k[11]<24));
mix($a,$b,$c);
splice(@k, 0, 11); $len -= 12;
}
$c += length $key;
goto "LAST".$len;
LAST11: $c+= $k[10] <<24;
LAST10: $c+= $k[9] <<16;
LAST9: $c+= $k[8] <<8;
LAST8: $b+= $k[7] <<24;
LAST7: $b+= $k[6] <<16;
LAST6: $b+= $k[5] <<8;
LAST5: $b+= $k[4];
LAST4: $a+= $k[3] <<24;
LAST3: $a+= $k[2] <<16;
LAST2: $a+= $k[1] <<8;
LAST1: $a+= $k[0];
LAST0:
mix($a,$b,$c);
return $c;
}
sub hash2 {
my ($key,$init) = @_;
my($a,$b,$c) = (GOLDEN_RATIO, GOLDEN_RATIO, $init);
my ($p, $len) = (0, length $key);
do {
my($x,$y,$z) = unpack 'V3', substr($key, $p, 12);
mix($a+=($x||0), $b+=($y||0), $c+=($z||$len));
$p+=12;
} while $p <= $len;
return $c;
}
my %freq;
my $n=0;
$n++, push @{$freq{hash( $_, 0)}}, $_ for 'AAAA' .. 'ZZZZ';
print $n, ':', scalar keys %freq;
undef(%freq);
$n=0;
$n++, push @{$freq{hash2( $_, 0)}}, $_ for 'AAAA' .. 'ZZZZ';
print $n, ':', scalar keys %freq;
Benchmark results
c:\test>hash
456976 values resulted in 233710 unique hashes & a maximum of 34 first
+ pass collisions
456976 values resulted in 233710 unique hashes & a maximum of 34 first
+ pass collisions
1 trial of hash (195.751s total)
1 trial of hash2 (133.032s total)
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.