Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: 32bit/64bit hash function: Use perls internal hash function?

by sectokia (Pilgrim)
on Apr 09, 2022 at 01:53 UTC ( [id://11142863] : note . print w/replies, xml ) Need Help??


in reply to Re: 32bit/64bit hash function: Use perls internal hash function?
in thread 32bit/64bit hash function: Use perls internal hash function?

My situations is this, any suggestions appreciated, probably someone has done this before:

In some situations perl hashes are bloated in memory usage: Example: store 100million 64 byte keys with 32bit values. If you do this perl uses >50GiB RAM! If you try and do 1billion keys... forget it.

Now I don't need to retrieve the keys or store them in memory (I am given them as input and just want the value back from the key).

So what I am doing is: Hash the key to 2x32bit. Use one 32bit hash as a lookup to an array of 4billion entries. In the array I store the 32bit value and the another 32bits of the hash for collision detection/handling. Result is I can store 100million items in 32GiB RAM. In fact I can store 500million and even push to 2billion items in 32GiB ram. Its only when I approach 4B will re-hashing for collision detecting become a problem.

Obviously to get perls performance, I need a faster hash than md5, benchmarks were showing it was very slow compared to perls internal hash.

The values I fetch are then used to seek to a file/positions on disk, to fetch small amounts of data (<64bytes). I was finding the fastest I could achieve was about 20% of the random IOPS of the SSD, with reads being 8KB. By changing to sysread I have been able to completely eliminate this problem 5x faster and now get IOPS same as SSD benchmark results.

  • Comment on Re^2: 32bit/64bit hash function: Use perls internal hash function?

Replies are listed 'Best First'.
Re^3: 32bit/64bit hash function: Use perls internal hash function?
by hv (Prior) on Apr 09, 2022 at 12:04 UTC

    Note that recent work by Nicholas Clark noticed that one of the trade-offs in perl's hash implementation was a bad fit for certain applications like this, and developed a change that would improve it. For an example application of my own (with a few 100 million hash entries) I found this reduced my memory usage by 20% and gave substantial speed improvements - maybe not enough of a saving for your case, but should at least move the threshold at which you need to consider alternatives.

    It currently looks likely that that will be available only as a build-time option for perl-5.36, and possibly as a flag you can turn on for individual hashes in 5.38 - the original plan was to turn it on always, but it turned out to have negative implications for certain other usage patterns. Here's the perldelta entry for the original plan: https://github.com/Perl/perl5/commit/fa92924b30.

    Hugo

Re^3: 32bit/64bit hash function: Use perls internal hash function?
by Fletch (Bishop) on Apr 09, 2022 at 02:34 UTC

    Two (possibly off base, but given your description . . .) ideas that come to mind:

    • use a Bloom Filter (edit: probably not relevant)
    • use DB_File and let it build a btree (backed on disk) for you rather than you manually (basically) doing similar

    Edit: tweaked; upon rereading I don't get that your dataset's necessarily sparse (in that you need to quickly check for membership or not) so the bloom suggestion's off base probably.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Or (shameless plug) a real database using Tie::Hash::DBD, though that might be slow.


      Enjoy, Have FUN! H.Merijn
Re^3: 32bit/64bit hash function: Use perls internal hash function?
by eyepopslikeamosquito (Archbishop) on Apr 09, 2022 at 02:36 UTC

    If you could illustrate your problem with an smallish example benchmark program that would be good and should lead to better answers.

    Without seeing any code, we are forced to guess. I am guessing your problem is similar to a thread I mentioned earlier, where some folks suggested using an external database (such as SQLite) instead of a gigantic Perl hash. From that thread, this quote from the infamous BrowserUk sounds similar to the problem you're facing:

    I've run Perl's hashes up to 30 billion keys/2 terabytes (ram) and they are 1 to 2 orders of magnitude faster, and ~1/3rd the size of storing the same data (64-bit integers) in an sqlite memory-based DB. And the performance difference increases as the size grows. Part of the difference is that however fast the C/C++ DB code is, calling into it from Perl, adds a layer of unavoidable overhead that Perl's built-in hashes do not have.

    Have you tried using a database (such as SQLite) instead of a gigantic Perl hash? If so, what was the performance difference?

    Though I can't personally endorse Judy Arrays because I've never used them, they worked for BrowserUk, albeit with a warning that the Judy code is horribly complex and hard to understand if something goes wrong.

      In terms of database performance: Basically I have ~120m+ records, about 10m per day added, and 10m deleted. And the need to fetch (via UDP request) a record by its identifer and respond within <5mS, with a peak around 100 to 150 requests per second.

      No database we have tried consistently achieves this. Most of them can do it for a while, then slow down, or inconsistently achieve it. MSSQL can go for 10+ seconds without answering a basic query if you are inserting and deleting heavily at the same time. Same with mssql. Even Redis will suddenly lurch for hundreds of milliseconds especially when large numbers of records all expire at the same time, including when you have it running under linux as RT with top FIFO scheduling. The simple fact seems to be almost no databases don't have branches that cause large amount of I/O and internal work to be done before they get back to doing their job of answering your query.

      In comparison this hash perl cache thing, has a perfect record because it never has to do anything else. So long as it has priority to disk and CPU, it works flawlessly. I can ingest over 100,000 records per second, while still servicing 5,000 requests per second on a modest PC. It achieves <1mS 100% of the time.

      In terms of why I want a faster hasher: The number of hashes per insert is 1 when the hash table is empty, averages 2 when the hash table is half full, and obviously ramps up to infinity when the hash table is full. By moving to a faster hash algorithm, I am able to move the hash table size down toward the total number of records (because as you do that number of hashes for an insert goes up) which wastes less RAM but increases CPU.

      In terms of the 8KB vs 512b IO fetch: This allows the number of records per second I can fetch to approach the storage/SSD's IOPS.