Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

A short meditation about hash search performance

by pg (Canon)
on Nov 15, 2003 at 21:13 UTC ( #307382=perlmeditation: print w/ replies, xml ) Need Help??

I read a post today saying that hash search performance is O(1), which is wrong.

I am posting this short meditation to correct the expectation. (I don't know how many times this false analysis have appeared in the past)

O(1) means that the worst case performance is not a function of the number of elements in the hash (n). Or the performance is a constant regardless how many elements you store in the hash.

That's not true.

The worst case hash search performance is O(n). This happens when all the hash elements happen to resolve the same hash key. In this case, a hash is no difference than a list.

If the number of valid hash keys is m, and you have n elements in a hash, the average search performance is o(n/2m), as the average queue length under a valid hash key in this case is n/m. I assume the search algorithm for the queue is simply go from the beginning to the end, so in average you have to go thru half length of the queue.

Comment on A short meditation about hash search performance
Re: A short meditation about hash search performance
by liz (Monsignor) on Nov 15, 2003 at 21:37 UTC
    Perldelta 5.8.2 says:

    The hash randomisation introduced with 5.8.1 has been amended. It transpired that although the implementation introduced in 5.8.1 was source compatible with 5.8.0, it was not binary compatible in certain cases. 5.8.2 contains an improved implementation which is both source and binary compatible with both 5.8.0 and 5.8.1, and remains robust against the form of attack which prompted the change for 5.8.1.

    What it doesn't say, is that an adaptive algorithm has been implemented that will cause a re-hashing of keys if any list (of identical hash keys) becomes too long. At least, that's what I remember from following the discussion from a distance. You might want to check out the p5p archives for specifics.

    Liz

      Thanks liz for the add-on information.

      This surely shortens the length of the longer queue(s), if it kicks in at the right time. So what it says is that the chance run into the worst analysis I given, is probably reduced.

      However this does not affect the analysis on average performance.

      And still O(1) is not reachable, unless each element resolve a unique key ;-) (If that's the case, the document liz provided shall not be there, as the queue length would always be 1, and there is no need to shorten it. The fact there is such piece of info there, clearly indicates the opposite.)

      Update:

      Have read liz's reply and her update, especially her update, yes, I agree that Perl must only kick in the rehash base on certain carefully calculated justification, considering the cost of the re-hash.

      The interesting and myterous part is what that justification is...(in a private chatting, liz pointed me to hv.c and HV_MAX_LENGTH_BEFORE_SPLIT)

        So what it says is that the chance run into the worst analysis I given, is probably reduced.

        Indeed. The impetus for the random key hashing scheme, was the potential for a DOS attack when a fixed key hashing scheme was used. So 5.8.1 introduced a random seed for hashing keys. However, for long running perl processes (think mod_perl), it was thinkable that the hash seed was "guessable" from performance of the program on various inputs. Since there was a binary compatibility issue as well, schemes were tried out to fix both.

        Once people realized you're really talking about a general performance issue, it started to make sense to make the algorithm self-adapting depending on the length of the lists of identical hash keys.

        Abigail-II did a lot of benchmarking on it. Maybe Abigail-II would like to elaborate?

        Liz

        Update:
        (If that's the case, the document liz provided shall not be there, as the queue length would always be 1 ...

        A same hash key list length of 1 for all hash keys, would be optimal if there were no other "costs" involved. However, the re-hashing of existing keys is not something to be done lightly, especially if the number of existing keys is high. So you need to find the best possible combination of same hash key list length and re-hashing. In that respect, the ideal same hash key list length is not 1!

        Once again you have posted a meditation in which you have made claims about Perl performance which differs vastly from reality. Again, a little bit of research on your part would have revealed the re-hashing algorithm in place to deal with hash collisions. My suggestion for you is to read through the Perl source tree, before you post about perceived issues or dogma relating to Perl performance.

        And still O(1) is not reachable, unless each element resolve a unique key ;-)

        Man, this is *so* wrong. First of all, the above statement is not for hashes in general. Even if a billion elements hash to the same key, you at most have to search a billion elements. And a billion differs from 1 only by a constant - so that's O(1). Second, it's especially not true in 5.8.2 because it will increase the hash size (which leads to a different hash function) when the chains get too large.

        Next time, could you please get your facts straight before posting FUD?

        Abigail

Re: A short meditation about hash search performance
by Anonymous Monk on Nov 15, 2003 at 21:58 UTC

    The epected or average search time is O(1) under reasonable assumptions, the worst case is O(n) for the fully degenerate case. Insertion and deletion can be done in O(1) worst case time. Most people just state that basic hash operations can be done in O(1) average time.

      Well, if you can't do queries in O(1) time, you can't do deletes in O(1) time (because to delete something, you first need to find it), and you can only do inserts in O(1) if you accept duplicates - which Perl hashes don't.

      Abigail

        Perhaps I need to re-review the analysis of CLR-1990*, because they clearly state on page 223:
        The worst-case running time for insertion is O(1). For searching, the worst-case running time is proportional to the length of the list: we shall analyze this more closely below. Deletion of an element x can be accomplished in O(1) time if the lists are doubly linked.

        Note, I wasn't referring to perl's hashes in particular, just hashes in general.

        * Cormen, Leiserson, Rivest 1990: Introduction to Algorithms. MIT Press, Cambridge Massachusetts.

Re: A short meditation about hash search performance
by Courage (Parson) on Nov 16, 2003 at 13:27 UTC
    Best search performance could be reached as O(log(N)) by using balanced tree.
    You could implement this as a perl module, although this will be much slower than internal implementation O(N).
    Also do not forget that adding into search list also counts, and I think adding into a hash would cost O(1)

    It is know that commonly used in C library qsort is also O(N*N) in worst case and something like O(N*LOG(N)) in real life.
    Also sorting method changed to mergesort in perl-5.8.0 and it was qsort for earlier perls.

    Best regards,
    Courage, the Cowardly Dog

Re: A short meditation about hash search performance
by tilly (Archbishop) on Nov 17, 2003 at 16:40 UTC
    You appear to have a linguistic confusion about big-O notation, followed up by a common misconception of what O(1) means.

    A function f(n) is O(1) if there are constants K and N such that for all n>N, f(n)<K. Plenty of functions other than straightforward constants meet that definition.

    An algorithm is not big-O of anything. Only functions are. When it comes to hashes, the following three statements can be correct:

    1. The average hash search performance is O(1).
    2. The worst case hash search performance is O(n).
    3. The average hash search performance is O(n).
    The first statement is why people use hashes. On average - and usually we are average - they are fast. The second is what you were pointing out as a correction explaining why the first is wrong. It doesn't correct it, it is an entirely distinct point. The third statement is true because big-O notation is only about the existence of upper bounds. I point it out to mention that common use of big-O notation among hackers is distinctly different from what you will find in Knuth and other official references.

    I should note that technically speaking, Perl's hash performance isn't big-O of anything. Perl's hashing algorithms break down for datasets that do not fit in current memory and cannot be addressed by 32 or 64-bit pointers (depending on the platform). I would have to look, but I think that the hash function won't scale beyond a billion or so buckets. Good luck finding someone who cares though.

    Silly trivia. Following a pointer is not really O(1). The amount of work taken depends on how long the pointer is, and therefore the size of the possible address space. People only notice this when they are on a platform where they have the choice of working with two different sizes of data representation, like 16 vs 32 bit. Or 32 vs 64 bit. Going to the larger size brings with it a necessary speed hit.

      Man you were so hard, I think he meant the functions behind the algorithms responded to O(log2(n)) asymptotic performance. The big o notation has many uses in other fields but in computing is used with its basic meaning, that's to remark the asymptotic character of the expression enclosed for the function being referred. Man it seemed you were dying to get the book out of the shelf and spit out some theoretical speech. I think you got the point first time, didn't you? Peace
        There are so many things to respond to here that I have no idea where to begin. So I'll list them randomly:
        • Where do you get the O(log2(n)) from? With a flat memory model, hashing algorithms are not O(log2(n)).
        • Re-read my post and you'll see that I explicitly acknowledge the fact that hackers do not exactly use big-O notation in the way that Knuth did. Why do you think that he said otherwise?
        • Re-read the root post and you'll discover that pg both misunderstood big-O notation as used by Knuth, and as used by hackers. As far as most people are concerned, a hash lookup is O(1). That he thought otherwise was due to a misunderstanding on his part about what big-O notation means.
        • Re-read the root post and you'll find lots of incorrect attempted pedantry. When someone tries to get pedantic, I think it is fair and reasonable to be pedantic back.
Re: A short meditation about hash search performance
by Jenda (Abbot) on Sep 10, 2007 at 16:16 UTC

    I believe you mean "the number of buckets" not the number of valid hash keys ... there is a catch though ... the m usually depends on n! How exactly, depends, but I beleive in the Perl builtin implementation its 2*n rounded to the nearest power of 2. Which means your o(n/2m) becomes o(n/4n) wich becomes o(1/4) and ... well ... constants are irrelevant in the o() notation thus it's o(1). QED.

    The average case of hash lookup even with the naive implementation is o(1), the worst case is O(n) for the naive implementation, but there are ways around that. They make the implementation more complex and insertion more costly, but they are possible. You can rehash once the list in a bucket gets too long or you can use binary trees instead of lists in the buckets.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://307382]
Approved by antirice
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-09-20 22:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (163 votes), past polls