Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Fast string hash in portable perl?

by jhanna (Scribe)
on Dec 19, 2003 at 19:16 UTC ( #315876=perlquestion: print w/ replies, xml ) Need Help??
jhanna has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I manage ASSP (assp.sf.net), a gpl spam filter entirely in perl. I'm looking for a pure perl string hashing subroutine that is as fast as possible with decent hash distribution. Pass a string in and get a 32bit int back. No need for cryptographic security. No need for the same results on different operating systems.

I've thought of unpacking the string as 32bit uints and xor-ing them. That would be fast, but i'm not sure it has ideal distribution over the solution space.

Perl has it's own string hasher built in -- is there an under-the-hood way to get it to process my strings and get ints out? That should be fast and work cross platform.

Or what's your favorite 32bit hash algorythm?

John

Comment on Fast string hash in portable perl?
Re: Fast string hash in portable perl?
by duff (Vicar) on Dec 19, 2003 at 19:25 UTC

    There's the B::hash() routine which returns a string representation of the hexadecimal representation of the number generated by perl's hashing function. perldoc B

    Is that good enough?

      I think this will be the fastest just because it is actually using compiled code in the perl bin to perform the hash. The only issue is that perls hash keygen does allow collisions and they are going to be much more frequent than lets say an md5 generator. it all depends what you are using it for.


      -Waswas
      This is exactly what i was looking for. I need to spend more time in the perldoc B. :-) I'll futz with it a bit to make sure it has few enough collisions, but I expect it will do nicely.

      Thank you kindly!

      j

Re: Fast string hash in portable perl?
by Corion (Pope) on Dec 19, 2003 at 19:25 UTC

    I believe if you're hell-bent on pure Perl, there is no faster thing than the CRC32 for "sufficiently good" hashing. Enough work has gone into CRC32 to make it fast, as it has been in use for a long time. Oh - there even is a module for it - CRC32

    But why not simply use Digest::MD5 ? I think it's core nowadays, and there even is a (slow) pure Perl version available ...

    Update:Added link to the CRC32 module

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

      CRC32 is not a good choice for hashing strings for lookup purposes. There are too many collosion hotspots. And MD5 isn't a 32-bit value.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!

        Not 32 bit? Then just run the B::hash on the MD5! Just kidding folks, just kidding!

        You are exactly right on CRC32. I can't think of any case where CRC32 is really a good choice.

      PurePerl generally means things that don't require a a C compiler to install, so that Win32 can easily use it. CRC32 has an XS component.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        How is that an issue?
Re: Fast string hash in portable perl?
by BrowserUk (Pope) on Dec 19, 2003 at 19:41 UTC

    Here's one I prepared ealier:) I went through this about 6 months back and found this hashing function on the web which seems to be a fairly good one.

    I posted here (and here D'oh!) in an attempt to get the quickest implementation of this in pure perl, and this was my chosen implementation based on pfaut's offering.

    use constant GOLDEN_RATIO => 0x9e3779b9; use constant A => 0; use constant B => 1; use constant C => 2; sub mix ($$$) { $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>13); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<< 8); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>13); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>12); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<16); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>> 5); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>> 3); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<10); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>15); } use constant KEY => 0; use constant INITHASH => 1; sub hash { use integer; my ($a, $b, $c) = ( GOLDEN_RATIO, GOLDEN_RATIO, $_[INITHASH] ); my ($p, $len) = (0, length $_[KEY]); do { my($x,$y,$z) = unpack 'LLL', substr($_[KEY] . (chr(0)x11), $p, + 12); $z||='0'; $z<<=8; mix($a += $x, $b += $y, $c += $z); $p += 12; } while $p <= $len; return $c; }

    You'd best do your own testing of this before you consider using it, to convince yourself it is a faithful implementation of the original.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

      We have this hashing function (Jenkins Hash) as an XS extension, yes you have to compile it (we have compiled on Win32 and Linux without issues) but the result is of course much faster. You can get a copy from http://tachyon.perlmonk.org/Digest-JHash-0.02.tar.gz

      We did a fair bit of dispersion testing on it and found it to be pretty damn good using URLs as the input data which is what we wanted it for. The dispersion peaks at somewhere around 90%+ of slots from memory.

      Maybe we should post Digest::JHash and Digest::JHahs::PurePerl to CPAN?. Here it is as XS (Digest::JHash.XS):

      #include "EXTERN.h" #include "perl.h" #include "XSUB.h" /* Jenkins Hash http://burtleburtle.net/bob/hash/doobs.html */ const int DEBUG = 0; #define MIX(a,b,c) \ { \ a -= b; a -= c; a ^= (c>>13); \ b -= c; b -= a; b ^= (a<<8); \ c -= a; c -= b; c ^= (b>>13); \ a -= b; a -= c; a ^= (c>>12); \ b -= c; b -= a; b ^= (a<<16); \ c -= a; c -= b; c ^= (b>>5); \ a -= b; a -= c; a ^= (c>>3); \ b -= c; b -= a; b ^= (a<<10); \ c -= a; c -= b; c ^= (b>>15); \ } unsigned long jhash( SV* str ) { STRLEN rawlen; char* p; unsigned long a, b, c, len, length; /* extract the string data and string length from the perl scalar +*/ p = (char*)SvPV(str, rawlen); length = len = (unsigned long)rawlen; /* Test for undef or null string case and return 0 */ if ( length == 0 ) { DEBUG && printf( "Recieved a null or undef string!\n" ); return 0; } DEBUG && printf( "Received string '%.*s'.\n", (int)len, p ); a = b = 0x9e3779b9; /* golden ratio suggested by Jenkins */ c = 0; while (len >= 12) { a += (p[0] + (((unsigned long)p[1])<<8) + (((unsigned long)p[2 +])<<16) + (((unsigned long)p[3])<<24)); b += (p[4] + (((unsigned long)p[5])<<8) + (((unsigned long)p[6 +])<<16) + (((unsigned long)p[7])<<24)); c += (p[8] + (((unsigned long)p[9])<<8) + (((unsigned long)p[1 +0])<<16) + (((unsigned long)p[11])<<24)); MIX(a, b, c); p += 12; len -= 12; } c += length; switch(len) { case 11: c+=((unsigned int)p[10]<<24); case 10: c+=((unsigned int)p[9]<<16); case 9: c+=((unsigned int)p[8]<<8); case 8: b+=((unsigned int)p[7]<<24); case 7: b+=((unsigned int)p[6]<<16); case 6: b+=((unsigned int)p[5]<<8); case 5: b+=((unsigned int)p[4]); case 4: a+=((unsigned int)p[3]<<24); case 3: a+=((unsigned int)p[2]<<16); case 2: a+=((unsigned int)p[1]<<8); case 1: a+=((unsigned int)p[0]); } MIX(a, b, c); DEBUG && printf( "Hash value is %d.\n", (int)(c) ); return(c); } MODULE = Digest::JHash PACKAGE = Digest::JHash unsigned long jhash(str) SV* str # Digest::JHash.pm package Digest::JHash; use strict; require Exporter; require DynaLoader; our @ISA = qw(Exporter DynaLoader); our @EXPORT_OK = qw( jhash ); our $VERSION = '0.02'; bootstrap Digest::JHash $VERSION; 1; =head1 NAME Digest::JHash - Perl extension for JHash Hashing Algoritm =head1 SYNOPSIS use Digest::JHash qw(jhash); $digest = jhash($data); # note that calling jhash() directly like this is the fastest way: $digest = Digest::JHash::jhash($data); =head1 DESCRIPTION The C<Digest::JHash> module allows you to use the fast JHash hashing a +lgorithm developed by Bob Jenkins from within Perl programs. The algorithm take +s as input a message of arbitrary length and produces as output a 32-bit "message digest" of the input in the form of an unsigned long integer. Call it a low calorie version of MD5 if you like. See http://burtleburtle.net/bob/hash/doobs.html for more information. =head1 FUNCTIONS =over 4 =item jhash($data) This function will calculate the JHash digest of the "message" in $dat +a and return a 32 bit integer result (an unsigned long in the C) =back =head1 EXPORTS None by default but you can have the jhash() function if you ask nicel +y. See below for reasons not to use Exporter (it is slower than a direct +call) =head1 SPEED NOTE If speed is a major issue it is roughly twice as fast to do call the j +hash() function like Digest::JHash::jhash('message') than it is to import the jhash() method using Exporter so you can call it as simply jhash('mess +age'). There is a short script that demonstrates the speed of differecnt call +ing methods (direct, OO and Imported) in misc/oo_vs_func.pl =head1 AUTHORS The JHash implementation was written by Bob Jenkins (C<bob_jenkins@burtleburtle.net>). This perl extension was written by Andrew Towers (C<mariofrog@bigpond.com>). A few mods were added by James Freeman (C<james.freeman@id3.org.uk>). =head1 SEE ALSO http://burtleburtle.net/bob/hash/doobs.html =cut

      cheers

      tachyon

        Thanks tachyon, that's great.

        I did compile the original C code as a C program and use it for comparisons with the perl version when I was working on this, though that was some time ago. The C was obviously much faster.

        Unfortunately, despite my best efforts, I still live in the "nothing I must compile myself" perl world.

        I do have a version of 5.8.1 that I compiled with Borland that works, but every time I've tried to use modules that require compilation in conjunction with that build, I spend hours or days trying to hack the build into working. I'm ashamed to say that I gave up and went back to being a AS dependant.

        Thanks to the number of kind folks around that make binary versions of those modules that AS haven't gotten around to yet, I rarely encounter a module that I cannot install and use in short order, given the appropriate googling.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://315876]
Approved by calin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-09-02 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (18 votes), past polls