Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

32bit/64bit hash function: Use perls internal hash function?

by sectokia (Pilgrim)
on Apr 08, 2022 at 07:29 UTC ( #11142822=perlquestion: print w/replies, xml ) Need Help??

sectokia has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a very fast hash function that gives 32bit and/or 64bit values. At the moment I just truncating MD5.

I figured that perl already does this internally, but I couldn't figure out how to call the hash function inside a perl script. Is it possible?

  • Comment on 32bit/64bit hash function: Use perls internal hash function?

Replies are listed 'Best First'.
Re: 32bit/64bit hash function: Use perls internal hash function?
by haukex (Archbishop) on Apr 08, 2022 at 08:02 UTC

    Crypt::Checksum::CRC32 (among others) is implemented in XS, and unlike Perl's internal hash function, which may change from release to release and build to build*, is standardized. Also, when talking about speed, it's important to be specific and to benchmark. How fast is your current implementation? How fast do you need it to be? Have you profiled to find your bottlenecks? What kind of data are you checksumming (what size)? And so on.

    * Update: Not only that, since hashing is randomized, it (clarification: in this case the checksum, not the function) will change from run to run, as I showed here.

Re: 32bit/64bit hash function: Use perls internal hash function?
by Tux (Canon) on Apr 08, 2022 at 09:29 UTC

    Maybe perl's unpack is enough for you:

    In addition to fields allowed in pack, you may prefix a field with a <number> to indicate that you want a <number>-bit checksum of the items instead of the items themselves. Default is a 16-bit checksum. The checksum is calculated by summing numeric values of expanded values (for string fields the sum of ord ($char) is taken; for bit fields the sum of zeroes and ones).

    For example, the following computes the same number as the System V sum program:

    my $checksum = do { local $/; # slurp! unpack ("%32W*", readline) % 65535; };

    Enjoy, Have FUN! H.Merijn
Re: 32bit/64bit hash function: Use perls internal hash function?
by eyepopslikeamosquito (Archbishop) on Apr 08, 2022 at 10:12 UTC

    You've just asked two desperate-looking performance-related questions in quick succession, so I feel obliged to point out (as described in detail at On Code Optimization) that you are taking the wrong (and depressingly common) approach.

    Before asking for advice, please provide us with some background on why performance matters for your problem ... then further clarify by presenting us with a benchmark program that we can run (for a simple example, using Perl's core Benchmark module, see Fastest way to lookup a point in a set).

    In other words, "Donít Optimize Code -- Benchmark It" (Damian Conway).

      My situations is this, any suggestions appreciated, probably someone has done this before:

      In some situations perl hashes are bloated in memory usage: Example: store 100million 64 byte keys with 32bit values. If you do this perl uses >50GiB RAM! If you try and do 1billion keys... forget it.

      Now I don't need to retrieve the keys or store them in memory (I am given them as input and just want the value back from the key).

      So what I am doing is: Hash the key to 2x32bit. Use one 32bit hash as a lookup to an array of 4billion entries. In the array I store the 32bit value and the another 32bits of the hash for collision detection/handling. Result is I can store 100million items in 32GiB RAM. In fact I can store 500million and even push to 2billion items in 32GiB ram. Its only when I approach 4B will re-hashing for collision detecting become a problem.

      Obviously to get perls performance, I need a faster hash than md5, benchmarks were showing it was very slow compared to perls internal hash.

      The values I fetch are then used to seek to a file/positions on disk, to fetch small amounts of data (<64bytes). I was finding the fastest I could achieve was about 20% of the random IOPS of the SSD, with reads being 8KB. By changing to sysread I have been able to completely eliminate this problem 5x faster and now get IOPS same as SSD benchmark results.

        Note that recent work by Nicholas Clark noticed that one of the trade-offs in perl's hash implementation was a bad fit for certain applications like this, and developed a change that would improve it. For an example application of my own (with a few 100 million hash entries) I found this reduced my memory usage by 20% and gave substantial speed improvements - maybe not enough of a saving for your case, but should at least move the threshold at which you need to consider alternatives.

        It currently looks likely that that will be available only as a build-time option for perl-5.36, and possibly as a flag you can turn on for individual hashes in 5.38 - the original plan was to turn it on always, but it turned out to have negative implications for certain other usage patterns. Here's the perldelta entry for the original plan: https://github.com/Perl/perl5/commit/fa92924b30.

        Hugo

        Two (possibly off base, but given your description . . .) ideas that come to mind:

        • use a Bloom Filter (edit: probably not relevant)
        • use DB_File and let it build a btree (backed on disk) for you rather than you manually (basically) doing similar

        Edit: tweaked; upon rereading I don't get that your dataset's necessarily sparse (in that you need to quickly check for membership or not) so the bloom suggestion's off base probably.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        If you could illustrate your problem with an smallish example benchmark program that would be good and should lead to better answers.

        Without seeing any code, we are forced to guess. I am guessing your problem is similar to a thread I mentioned earlier, where some folks suggested using an external database (such as SQLite) instead of a gigantic Perl hash. From that thread, this quote from the infamous BrowserUk sounds similar to the problem you're facing:

        I've run Perl's hashes up to 30 billion keys/2 terabytes (ram) and they are 1 to 2 orders of magnitude faster, and ~1/3rd the size of storing the same data (64-bit integers) in an sqlite memory-based DB. And the performance difference increases as the size grows. Part of the difference is that however fast the C/C++ DB code is, calling into it from Perl, adds a layer of unavoidable overhead that Perl's built-in hashes do not have.

        Have you tried using a database (such as SQLite) instead of a gigantic Perl hash? If so, what was the performance difference?

        Though I can't personally endorse Judy Arrays because I've never used them, they worked for BrowserUk, albeit with a warning that the Judy code is horribly complex and hard to understand if something goes wrong.

Re: 32bit/64bit hash function: Use perls internal hash function?
by haukex (Archbishop) on Apr 08, 2022 at 09:49 UTC

    To answer your question directly, PERL_HASH is exposed by B:

    $ perl -MB=hash -le 'print hash("foo")' 0xad6987c4 $ perl -MB=hash -le 'print hash("foo")' 0x5fd458f7

    But as I said, you may want to consider using a standardized function, which when implemented in XS should be "nearly as fast". Also, keep in mind that in hashes, collisions are acceptable - short checksums will do that, and since you haven't told us anything about the problem you're trying to solve, we can't know if that is appropriate in your case.

      Sadly, B from hash() is useless for high speed stuff, because it appears to use a printf function to make its output string, which cripples its speed. Seems to be used just for debugging hash values, and not as interface to get a 32bit hash.
        it appears to use a printf function to make its output string

        That is true, but its implementation in XS is pretty simple and can be adapted to suit your needs. Given the performance constraints you've described, I think you're going to have to venture in the direction of C/XS anyway.

        Update: In fact, it turns out to be pretty easy!

        use warnings; use strict; use B 'hash'; use Inline C => <<'_C_'; U32 myhash(SV* sv) { STRLEN len; U32 hash = 0; const char *s = SvPVbyte(sv, len); PERL_HASH(hash, s, len); return hash; } _C_ print hash("foo"), "\n"; printf "%#x\n", myhash("foo"); __END__ 0x6611676e 0x6611676e
      Thanks! I am a complete noob when it comes to XS, need to really learn it.
        use strict; use warnings; use feature 'say'; use B 'hash'; use Crypt::xxHash 'xxhash3_64bits'; use Digest::xxH64 'xx64'; use Benchmark 'cmpthese'; use Inline C => <<'_C_'; U32 myhash(SV* sv) { STRLEN len; U32 hash = 0; const char *s = SvPVbyte(sv, len); PERL_HASH(hash, s, len); return hash; } _C_ srand 1234; my $s = pack 'C*', map rand 256, 1 .. 64; cmpthese -2, { hash => sub { my $x = hash( $s )}, myhash => sub { my $x = myhash( $s )}, xxhash => sub { my $x = xxhash3_64bits( $s, 0 )}, xx64 => sub { my $x = xx64( $s )}, }; __END__ Rate hash myhash xxhash xx64 hash 1944302/s -- -52% -54% -84% myhash 4088577/s 110% -- -3% -66% xxhash 4233986/s 118% 4% -- -65% xx64 11994386/s 517% 193% 183% -- This is perl 5, version 32, subversion 1 (v5.32.1) built for MSWin32-x +64-multi-thread

        Try xxHash? The Digest::xxH64 is not on CPAN (but linked to from home i.e. officially 'endorsed'(?):)), Crypt::xxHash needs a fix to install in Windows, and Digest::xxHash (not in example above) is slower and therefore perhaps not of much interest in context of 'B::hash is too slow'.

        As already mentioned, the Judy::HS provides both hashing and sparse storage already built-in under-the-hood. So maybe manually-done hashing is not what you need. I have 'played' (i.e. not in serious 'production') with Judy (but not with Judy::HS) to store and access huge sparse data, and, yes, speed is comparable to Perl hashes with significantly less RAM appetites.

        Another option to consider: Math::GSL::SparseMatrix (and GSL being solid and renowned, etc.). As above, I 'played' with 64-bit-addressed sparse single-row (or was it single-column?) vector. Slower than Judy, yet installs without hassle in Windows, theoretically can address 128-bit sparse space (because of 2D) and can store data shorter than 64-bit integers i.e. needs even less RAM in that case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11142822]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2023-11-29 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?