Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Better Hash Tables?

by QM (Parson)
on Feb 15, 2025 at 16:07 UTC ( [id://11164002]=perlquestion: print w/replies, xml ) Need Help??

QM has asked for the wisdom of the Perl Monks concerning the following question:

Do we know anything more useful about this?

Undergraduate Upends a 40-Year-Old Data Science Conjecture

Paper on arxiv

Since Perl uses linked lists when collisions occur, most (all?) of this new work seems irrelevant?

What are the methods used in other langauges / implementations?

In my brief research, it seems the only marginally questionable thing that Perl does is not resize based on collisions. It seems, in the worst case, all entries go in the same bucket, but a resize isn't triggered until 75% loading factor is reached?

-QM
--
Quantum Mechanics: The dreams stuff is made of

Replies are listed 'Best First'.
Re: Better Hash Tables?
by cavac (Prior) on Feb 18, 2025 at 15:45 UTC
      Both rust and golang have ports of google's Swiss Tables, which use less memory and are faster because they use AES and SSE instructions.
      Thanks for the links.

      There's a lot going unsaid in that article (not surprised, we're not the intended audience).

      From my further research, Perl resizes based on load factor, typically >75% (depends on the implementation). However, collisions are handled with a linked list, so in the worst case, a single bucket would hold all of the entries.

      The article and paper talk about "open-addressing". An open-addressing hash table is a data structure that uses probing to resolve collisions when storing keys in a fixed-length array. It's also known as closed hashing. (Nothing like confusing terminology.)

      Apparently, Python uses open-addressing with probing. The loading factor threshold for resizing is 2/3.

      Since Perl uses linked lists to "resolve" collisions, this paper doesn't appear to apply.

      There was this weird fact that popped up: Instead of just doubling, Python picks the next power of two to optimize performance. I mean, it doesn't start with a power-of-two size?

      There was nothing in the article that helped me understand why the paper makes anything better, except hand-wavy declarations that it's true.

      I also started to read the linked paper. Didn't get very far, didn't have the time to wade in and learn the terminology, and read the other papers.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Better Hash Tables?
by Anonymous Monk on Mar 05, 2025 at 04:56 UTC

    IMO this doesn't apply to perl at all because the whole concept of this paper is based on having a fixed size array that you are finding an unoccupied slot to push into. This is really more useful in memory contained environments and hardware / logic design.

    A recent example of this I faced is: I had 600million key:value records on disk. You cannot load these into a perl hash because of the memory requirements. However: You can store the disk offset in array, where the offset of the array is determined by hashing the key. This allows you to fit a hash table within a few gigabytes of RAM. To fetch an item you hash the key to get the offset, and jump to that offset to fetch the disk location and read the key:value and confirm the key.

    The only question is: How big should the array be? If its close to 600m you are endlessly dealing with hash collisions. But if you make it bigger, say 1200million, half your RAM is wasted. If you have a hash collision, you have no option but to re-hash, which determines a new random position. To fetch, you have to try the first candidate, load from disk, and see if the key matched, else hash again and fetch the next candidate as there must have been a collision.

    What this paper shows is that there is a better method than the above that can boost performance when the array is closer to 600m than simply randomly trying a new hash.

    It would be useful in situations like mine, but I don't see how its useful to perl which has dynamical linked lists of variable size.

Re: Better Hash Tables?
by Jenda (Abbot) on Feb 18, 2025 at 21:39 UTC
    Fullness can be described in terms of an overall percentage — this table is 50% full, that one’s 90% — but researchers often deal with much fuller tables. So instead, they may use a whole number, denoted by x, to specify how close the hash table is to 100% full. If x is 100, then the table is 99% full. If x is 1,000, the table is 99.9% full. This measure of fullness offers a convenient way to evaluate how long it should take to perform actions like queries or insertions.

    Seems once again researchers are researching exactly the wrong thing. I mean if the table is 99% full, you should have increased its size a long long time ago. If it's even more full, you are an ... you are clearly not interested in doing anything practical, but rather seem to be working towards a degree in theoretical something or other.

    While it's possible that there's something of some practical use in that paper (so far I only read the article), I would not hold my breath. If the topic was an already pathologically overstuffed hashes, then there might be something for the sanely proportioned ones too, but...

    Jenda
    1984 was supposed to be a warning,
    not a manual!

      Seems once again researchers are researching exactly the wrong thing.

      You haven't read the linked paper, have you?

      Greetings,
      🐻

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        If you read my comment, you can see me stating openly that I only react to the article, not to the paper. :-)

        I'm not through the paper yet and I admit I would prefer if they stated the algorithm (also) using (pseudo)code rather than (just) by a mix of english and maths, but so far I really do not think we will see this in perl any time soon. The algorithm looks fairly complicated and while I have no reason to dispute the asymptotic behavior, I would expect the real world performance for hashes that are not excessively full to be rather bad and I do believe that for real world applications it will be better to allocate more memory for the table than to complicate the code.

        Jenda
        1984 was supposed to be a warning,
        not a manual!

Re: Better Hash Tables?
by Anonymous Monk on Feb 19, 2025 at 22:31 UTC
    Is there really no code with benchmarks to go along with the paper?

    "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth

      No code, no benchmarks. This seems to be a purely theoretical paper from what little i can gather (pretty much gave up reading it).

      So we basically don't know if this algorithm is actually faster on modern processor architectures than the old one and under which circumstances. In most algos these days, you have to optimize for a time/space tradeoff. For example:

      • A sort algorithm might be faster if it takes more steps, but keeps the data it is currently working on in L1 cache.
      • A database takes lots of extra processing steps to generate an index of a table and keeps it up to date in RAM. But for searching, it can quickly-ish scan a few Gigabytes in RAM instead of reading Terrabytes of data from a spinning disk.
      • GPU data processing is weird. if-then-else statements get turned into complicated mathematical operations (slow, special compiler) so the GPU can iterate over multiple values at once (SIMD), and ideally processing the whole dataset in a single pass without jumps and conditionals.
      • On older 8/16 bit processors, developers used (big for the time) lookup tables for some mathematical operations like sin() or sqrt() whenever they could spare the memory, because it was faster to look up the result than to calculate it. On modern processors, it's very much the opposite.

      Without testing with real world hardware and data, it's hard, if not nearly impossible, to tell how well an algorithm will perform and what tradeoffs can be made under which circumstances to optimize it.

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
      Also check out my sisters artwork and my weekly webcomics
Re: Better Hash Tables?
by etj (Priest) on Feb 20, 2025 at 19:14 UTC
    Summarised: "I haven't read (much of) the actual paper that was published, but somehow I still have loudly-expressed opinions on the topic!"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11164002]
Approved by philipbailey
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2025-06-20 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.