A short meditation about hash search performance

http://www.perlmonks.org?node_id=307382

I read a post today saying that hash search performance is O(1), which is wrong.

I am posting this short meditation to correct the expectation. (I don't know how many times this false analysis have appeared in the past)

O(1) means that the worst case performance is not a function of the number of elements in the hash (n). Or the performance is a constant regardless how many elements you store in the hash.

That's not true.

The worst case hash search performance is O(n). This happens when all the hash elements happen to resolve the same hash key. In this case, a hash is no difference than a list.

If the number of valid hash keys is m, and you have n elements in a hash, the average search performance is o(n/2m), as the average queue length under a valid hash key in this case is n/m. I assume the search algorithm for the queue is simply go from the beginning to the end, so in average you have to go thru half length of the queue.

Comment on A short meditation about hash search performance

Replies are listed 'Best First'.

Re: A short meditation about hash search performance
by liz (Monsignor) on Nov 15, 2003 at 21:37 UTC

The hash randomisation introduced with 5.8.1 has been amended. It transpired that although the implementation introduced in 5.8.1 was source compatible with 5.8.0, it was not binary compatible in certain cases. 5.8.2 contains an improved implementation which is both source and binary compatible with both 5.8.0 and 5.8.1, and remains robust against the form of attack which prompted the change for 5.8.1.

What it doesn't say, is that an adaptive algorithm has been implemented that will cause a re-hashing of keys if any list (of identical hash keys) becomes too long. At least, that's what I remember from following the discussion from a distance. You might want to check out the p5p archives for specifics.

Liz

Re: Re: A short meditation about hash search performance

by pg (Canon) on Nov 15, 2003 at 21:47 UTC

Thanks liz for the add-on information.

This surely shortens the length of the longer queue(s), if it kicks in at the right time. So what it says is that the chance run into the worst analysis I given, is probably reduced.

However this does not affect the analysis on average performance.

And still O(1) is not reachable, unless each element resolve a unique key ;-) (If that's the case, the document liz provided shall not be there, as the queue length would always be 1, and there is no need to shorten it. The fact there is such piece of info there, clearly indicates the opposite.)

Update:

Have read liz's reply and her update, especially her update, yes, I agree that Perl must only kick in the rehash base on certain carefully calculated justification, considering the cost of the re-hash.

The interesting and myterous part is what that justification is...(in a private chatting, liz pointed me to hv.c and HV_MAX_LENGTH_BEFORE_SPLIT)

Re: Re: Re: A short meditation about hash search performance

by liz (Monsignor) on Nov 15, 2003 at 21:57 UTC

So what it says is that the chance run into the worst analysis I given, is probably reduced.

Indeed. The impetus for the random key hashing scheme, was the potential for a DOS attack when a fixed key hashing scheme was used. So 5.8.1 introduced a random seed for hashing keys. However, for long running perl processes (think mod_perl), it was thinkable that the hash seed was "guessable" from performance of the program on various inputs. Since there was a binary compatibility issue as well, schemes were tried out to fix both.

Once people realized you're really talking about a general performance issue, it started to make sense to make the algorithm self-adapting depending on the length of the lists of identical hash keys.

Abigail-II did a lot of benchmarking on it. Maybe Abigail-II would like to elaborate?

Liz

Update:
(If that's the case, the document liz provided shall not be there, as the queue length would always be 1 ...

A same hash key list length of 1 for all hash keys, would be optimal if there were no other "costs" involved. However, the re-hashing of existing keys is not something to be done lightly, especially if the number of existing keys is high. So you need to find the best possible combination of same hash key list length and re-hashing. In that respect, the ideal same hash key list length is not 1!

Re: A short meditation about hash search performance

by Abigail-II (Bishop) on Nov 16, 2003 at 02:54 UTC

Re: A short meditation about hash search performance

by Abigail-II (Bishop) on Nov 16, 2003 at 02:40 UTC

And still O(1) is not reachable, unless each element resolve a unique key ;-)

Man, this is *so* wrong. First of all, the above statement is not for hashes in general. Even if a billion elements hash to the same key, you at most have to search a billion elements. And a billion differs from 1 only by a constant - so that's O(1). Second, it's especially not true in 5.8.2 because it will increase the hash size (which leads to a different hash function) when the chains get too large.

Next time, could you please get your facts straight before posting FUD?

Abigail

Re: Re: A short meditation about hash search performance

by pg (Canon) on Nov 16, 2003 at 03:19 UTC

Re: A short meditation about hash search performance

by Abigail-II (Bishop) on Nov 16, 2003 at 23:03 UTC

Some notes below your chosen depth have not been shown here

Re: Re: Re: A short meditation about hash search performance

by Boots111 (Hermit) on Nov 16, 2003 at 20:40 UTC

Re: Re: Re: A short meditation about hash search performance

by Anonymous Monk on Nov 15, 2003 at 22:47 UTC

Once again you have posted a meditation in which you have made claims about Perl performance which differs vastly from reality. Again, a little bit of research on your part would have revealed the re-hashing algorithm in place to deal with hash collisions. My suggestion for you is to read through the Perl source tree, before you post about perceived issues or dogma relating to Perl performance.

Re: A short meditation about hash search performance
by tilly (Archbishop) on Nov 17, 2003 at 16:40 UTC

A function f(n) is O(1) if there are constants K and N such that for all n>N, f(n)<K. Plenty of functions other than straightforward constants meet that definition.

An algorithm is not big-O of anything. Only functions are. When it comes to hashes, the following three statements can be correct:

The average hash search performance is O(1).
The worst case hash search performance is O(n).
The average hash search performance is O(n).

I should note that technically speaking, Perl's hash performance isn't big-O of anything. Perl's hashing algorithms break down for datasets that do not fit in current memory and cannot be addressed by 32 or 64-bit pointers (depending on the platform). I would have to look, but I think that the hash function won't scale beyond a billion or so buckets. Good luck finding someone who cares though.

Silly trivia. Following a pointer is not really O(1). The amount of work taken depends on how long the pointer is, and therefore the size of the possible address space. People only notice this when they are on a platform where they have the choice of working with two different sizes of data representation, like 16 vs 32 bit. Or 32 vs 64 bit. Going to the larger size brings with it a necessary speed hit.

Re^2: A short meditation about hash search performance

by Anonymous Monk on Sep 09, 2007 at 23:13 UTC

Man you were so hard, I think he meant the functions behind the algorithms responded to O(log2(n)) asymptotic performance. The big o notation has many uses in other fields but in computing is used with its basic meaning, that's to remark the asymptotic character of the expression enclosed for the function being referred. Man it seemed you were dying to get the book out of the shelf and spit out some theoretical speech. I think you got the point first time, didn't you? Peace

Re^3: A short meditation about hash search performance

by tilly (Archbishop) on Sep 12, 2007 at 01:09 UTC

Where do you get the O(log2(n)) from? With a flat memory model, hashing algorithms are not O(log2(n)).
Re-read my post and you'll see that I explicitly acknowledge the fact that hackers do not exactly use big-O notation in the way that Knuth did. Why do you think that he said otherwise?
Re-read the root post and you'll discover that pg both misunderstood big-O notation as used by Knuth, and as used by hackers. As far as most people are concerned, a hash lookup is O(1). That he thought otherwise was due to a misunderstanding on his part about what big-O notation means.
Re-read the root post and you'll find lots of incorrect attempted pedantry. When someone tries to get pedantic, I think it is fair and reasonable to be pedantic back.

Re: A short meditation about hash search performance
by Anonymous Monk on Nov 15, 2003 at 21:58 UTC

The epected or average search time is O(1) under reasonable assumptions, the worst case is O(n) for the fully degenerate case. Insertion and deletion can be done in O(1) worst case time. Most people just state that basic hash operations can be done in O(1) average time.

Re: A short meditation about hash search performance

by Abigail-II (Bishop) on Nov 16, 2003 at 02:58 UTC

Abigail

Re: Re: A short meditation about hash search performance

by Anonymous Monk on Nov 16, 2003 at 05:19 UTC

The worst-case running time for insertion is O(1). For searching, the worst-case running time is proportional to the length of the list: we shall analyze this more closely below. Deletion of an element x can be accomplished in O(1) time if the lists are doubly linked.

Note, I wasn't referring to perl's hashes in particular, just hashes in general.

* Cormen, Leiserson, Rivest 1990: Introduction to Algorithms. MIT Press, Cambridge Massachusetts.

Re: A short meditation about hash search performance

by Abigail-II (Bishop) on Nov 16, 2003 at 22:15 UTC

Re: Re: Re: A short meditation about hash search performance

by Schemer (Scribe) on Nov 16, 2003 at 08:03 UTC

Re: A short meditation about hash search performance
by Courage (Parson) on Nov 16, 2003 at 13:27 UTC

It is know that commonly used in C library qsort is also O(N*N) in worst case and something like O(N*LOG(N)) in real life.
Also sorting method changed to mergesort in perl-5.8.0 and it was qsort for earlier perls.

Best regards,
Courage, the Cowardly Dog

Re: A short meditation about hash search performance
by Jenda (Abbot) on Sep 10, 2007 at 16:16 UTC

I believe you mean "the number of buckets" not the number of valid hash keys ... there is a catch though ... the m usually depends on n! How exactly, depends, but I beleive in the Perl builtin implementation its 2*n rounded to the nearest power of 2. Which means your o(n/2m) becomes o(n/4n) wich becomes o(1/4) and ... well ... constants are irrelevant in the o() notation thus it's o(1). QED.

The average case of hash lookup even with the naive implementation is o(1), the worst case is O(n) for the naive implementation, but there are ways around that. They make the implementation more complex and insertion more costly, but they are possible. You can rehash once the list in a bucket gets too long or you can use binary trees instead of lists in the buckets.

Jenda
Support Denmark!
Defend the free world!

Re^2: A short meditation about hash search performance

by BrowserUk (Patriarch) on Sep 10, 2007 at 16:40 UTC

but there are ways around that.... You can rehash ... or you can use binary trees instead of lists

You can also use a secondary hashing function within the buckets.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Re: A short meditation about hash search performance
by toro (Beadle) on Jul 15, 2015 at 23:03 UTC

"Hash search performance is Θ(1)". ←Is this correct?

Re^2: A short meditation about hash search performance

by BrowserUk (Patriarch) on Jul 15, 2015 at 23:09 UTC

On aggregate; yes.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Back to Meditations