QM has asked for the wisdom of the Perl Monks concerning the following question:
Do we know anything more useful about this?
Undergraduate Upends a 40-Year-Old Data Science Conjecture
Paper on arxiv
Since Perl uses linked lists when collisions occur, most (all?) of this new work seems irrelevant?
What are the methods used in other langauges / implementations?
In my brief research, it seems the only marginally questionable thing that Perl does is not resize based on collisions. It seems, in the worst case, all entries go in the same bucket, but a resize isn't triggered until 75% loading factor is reached?
-QM
--
Quantum Mechanics: The dreams stuff is made of
Re: Better Hash Tables?
by cavac (Prior) on Feb 18, 2025 at 15:45 UTC
|
This is certainly worth a look. I'm not an expert by any means when it comes to this stuff. But in general, it is certainly prudent to be very cautious when changing a crucial algorithm like hash tables. It may be faster in most cases, but is it also robust enough to withstand attacks? Is it guaranteed to always return the data that it should?
Even better performance than we already have would be nice, if this algo applies to Perl. But it would need very thorough testing in a wide range of settings and applications, with an easy way to switch between the current and the new algo within the same perl version.
| [reply] |
|
Both rust and golang have ports of google's Swiss Tables, which use less memory and are faster because they use AES and SSE instructions.
| [reply] |
|
Thanks for the links.
There's a lot going unsaid in that article (not surprised, we're not the intended audience).
From my further research, Perl resizes based on load factor, typically >75% (depends on the implementation). However, collisions are handled with a linked list, so in the worst case, a single bucket would hold all of the entries.
The article and paper talk about "open-addressing". An open-addressing hash table is a data structure that uses probing to resolve collisions when storing keys in a fixed-length array. It's also known as closed hashing. (Nothing like confusing terminology.)
Apparently, Python uses open-addressing with probing. The loading factor threshold for resizing is 2/3.
Since Perl uses linked lists to "resolve" collisions, this paper doesn't appear to apply.
There was this weird fact that popped up: Instead of just doubling, Python picks the next power of two to optimize performance. I mean, it doesn't start with a power-of-two size?
There was nothing in the article that helped me understand why the paper makes anything better, except hand-wavy declarations that it's true.
I also started to read the linked paper. Didn't get very far, didn't have the time to wade in and learn the terminology, and read the other papers.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
Re: Better Hash Tables?
by Anonymous Monk on Mar 05, 2025 at 04:56 UTC
|
IMO this doesn't apply to perl at all because the whole concept of this paper is based on having a fixed size array that you are finding an unoccupied slot to push into. This is really more useful in memory contained environments and hardware / logic design.
A recent example of this I faced is: I had 600million key:value records on disk. You cannot load these into a perl hash because of the memory requirements. However: You can store the disk offset in array, where the offset of the array is determined by hashing the key. This allows you to fit a hash table within a few gigabytes of RAM. To fetch an item you hash the key to get the offset, and jump to that offset to fetch the disk location and read the key:value and confirm the key.
The only question is: How big should the array be? If its close to 600m you are endlessly dealing with hash collisions. But if you make it bigger, say 1200million, half your RAM is wasted. If you have a hash collision, you have no option but to re-hash, which determines a new random position. To fetch, you have to try the first candidate, load from disk, and see if the key matched, else hash again and fetch the next candidate as there must have been a collision.
What this paper shows is that there is a better method than the above that can boost performance when the array is closer to 600m than simply randomly trying a new hash.
It would be useful in situations like mine, but I don't see how its useful to perl which has dynamical linked lists of variable size.
| [reply] |
Re: Better Hash Tables?
by Jenda (Abbot) on Feb 18, 2025 at 21:39 UTC
|
Fullness can be described in terms of an overall percentage — this table is 50% full, that one’s 90% — but researchers often deal with much fuller tables. So instead, they may use a whole number, denoted by x, to specify how close the hash table is to 100% full. If x is 100, then the table is 99% full. If x is 1,000, the table is 99.9% full. This measure of fullness offers a convenient way to evaluate how long it should take to perform actions like queries or insertions.
Seems once again researchers are researching exactly the wrong thing. I mean if the table is 99% full, you should have increased its size a long long time ago. If it's even more full, you are an ... you are clearly not interested in doing anything practical, but rather seem to be working towards a degree in theoretical something or other.
While it's possible that there's something of some practical use in that paper (so far I only read the article), I would not hold my breath. If the topic was an already pathologically overstuffed hashes, then there might be something for the sanely proportioned ones too, but...
Jenda
1984 was supposed to be a warning,
not a manual!
| [reply] |
|
| [reply] |
|
If you read my comment, you can see me stating openly that I only react to the article, not to the paper. :-)
I'm not through the paper yet and I admit I would prefer if they stated the algorithm (also) using (pseudo)code rather than (just) by a mix of english and maths, but so far I really do not think we will see this in perl any time soon. The algorithm looks fairly complicated and while I have no reason to dispute the asymptotic behavior, I would expect the real world performance for hashes that are not excessively full to be rather bad and I do believe that for real world applications it will be better to allocate more memory for the table than to complicate the code.
Jenda
1984 was supposed to be a warning,
not a manual!
| [reply] |
|
|
Re: Better Hash Tables?
by Anonymous Monk on Feb 19, 2025 at 22:31 UTC
|
Is there really no code with benchmarks to go along with the paper?
"Beware of bugs in the above code; I have only proved it correct, not tried it."
--
Donald Knuth | [reply] |
|
No code, no benchmarks. This seems to be a purely theoretical paper from what little i can gather (pretty much gave up reading it).
So we basically don't know if this algorithm is actually faster on modern processor architectures than the old one and under which circumstances. In most algos these days, you have to optimize for a time/space tradeoff. For example:
- A sort algorithm might be faster if it takes more steps, but keeps the data it is currently working on in L1 cache.
- A database takes lots of extra processing steps to generate an index of a table and keeps it up to date in RAM. But for searching, it can quickly-ish scan a few Gigabytes in RAM instead of reading Terrabytes of data from a spinning disk.
- GPU data processing is weird. if-then-else statements get turned into complicated mathematical operations (slow, special compiler) so the GPU can iterate over multiple values at once (SIMD), and ideally processing the whole dataset in a single pass without jumps and conditionals.
- On older 8/16 bit processors, developers used (big for the time) lookup tables for some mathematical operations like sin() or sqrt() whenever they could spare the memory, because it was faster to look up the result than to calculate it. On modern processors, it's very much the opposite.
Without testing with real world hardware and data, it's hard, if not nearly impossible, to tell how well an algorithm will perform and what tradeoffs can be made under which circumstances to optimize it.
| [reply] |
Re: Better Hash Tables?
by etj (Priest) on Feb 20, 2025 at 19:14 UTC
|
Summarised: "I haven't read (much of) the actual paper that was published, but somehow I still have loudly-expressed opinions on the topic!" | [reply] |
|
|