Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Re: A short meditation about hash search performance

by pg (Canon)
on Nov 15, 2003 at 21:47 UTC ( #307387=note: print w/ replies, xml ) Need Help??


in reply to Re: A short meditation about hash search performance
in thread A short meditation about hash search performance

Thanks liz for the add-on information.

This surely shortens the length of the longer queue(s), if it kicks in at the right time. So what it says is that the chance run into the worst analysis I given, is probably reduced.

However this does not affect the analysis on average performance.

And still O(1) is not reachable, unless each element resolve a unique key ;-) (If that's the case, the document liz provided shall not be there, as the queue length would always be 1, and there is no need to shorten it. The fact there is such piece of info there, clearly indicates the opposite.)

Update:

Have read liz's reply and her update, especially her update, yes, I agree that Perl must only kick in the rehash base on certain carefully calculated justification, considering the cost of the re-hash.

The interesting and myterous part is what that justification is...(in a private chatting, liz pointed me to hv.c and HV_MAX_LENGTH_BEFORE_SPLIT)


Comment on Re: Re: A short meditation about hash search performance
Re: Re: Re: A short meditation about hash search performance
by liz (Monsignor) on Nov 15, 2003 at 21:57 UTC
    So what it says is that the chance run into the worst analysis I given, is probably reduced.

    Indeed. The impetus for the random key hashing scheme, was the potential for a DOS attack when a fixed key hashing scheme was used. So 5.8.1 introduced a random seed for hashing keys. However, for long running perl processes (think mod_perl), it was thinkable that the hash seed was "guessable" from performance of the program on various inputs. Since there was a binary compatibility issue as well, schemes were tried out to fix both.

    Once people realized you're really talking about a general performance issue, it started to make sense to make the algorithm self-adapting depending on the length of the lists of identical hash keys.

    Abigail-II did a lot of benchmarking on it. Maybe Abigail-II would like to elaborate?

    Liz

    Update:
    (If that's the case, the document liz provided shall not be there, as the queue length would always be 1 ...

    A same hash key list length of 1 for all hash keys, would be optimal if there were no other "costs" involved. However, the re-hashing of existing keys is not something to be done lightly, especially if the number of existing keys is high. So you need to find the best possible combination of same hash key list length and re-hashing. In that respect, the ideal same hash key list length is not 1!

      Abigail-II did a lot of benchmarking on it. Maybe Abigail-II would like to elaborate?
      The benchmark was fairly simply: take about million different words, insert them in a hash, measure how long it takes and what the average chain length is. They are all common words, it's the combination of various English wordlists I once grabbed from a puzzle site, and long list of Dutch words. No specially prepared input. The average chain length was 1.27 on 5.8.0, 5.8.1, 5.8.2-RC1 and 5.8.2-RC2. The only interesting thing was the time - it took about 17.5 seconds on 5.8.0, 5.8.1 and 5.8.2-RC2, and less almost 4 seconds less on 5.8.2-RC1.
      A same hash key list length of 1 for all hash keys, would be optimal if there were no other "costs" involved. However, the re-hashing of existing keys is not something to be done lightly, especially if the number of existing keys is high. So you need to find the best possible combination of same hash key list length and re-hashing. In that respect, the ideal same hash key list length is not 1!
      I do not agree with the latter conclusion. The best possible combination of max chain length and re-hashing depends on the ratio number of inserts vs number of queries (for the sake of simplicity, let's not consider deletes). The lower this ratio is (that is, the more queries you have), the more time you can spend on inserts to get a better overal performance. That is, if you have enough queries, it pays to have max chain length of 1.

      Abigail

Re: Re: Re: A short meditation about hash search performance
by Anonymous Monk on Nov 15, 2003 at 22:47 UTC

    Once again you have posted a meditation in which you have made claims about Perl performance which differs vastly from reality. Again, a little bit of research on your part would have revealed the re-hashing algorithm in place to deal with hash collisions. My suggestion for you is to read through the Perl source tree, before you post about perceived issues or dogma relating to Perl performance.

Re: A short meditation about hash search performance
by Abigail-II (Bishop) on Nov 16, 2003 at 02:40 UTC
    And still O(1) is not reachable, unless each element resolve a unique key ;-)

    Man, this is *so* wrong. First of all, the above statement is not for hashes in general. Even if a billion elements hash to the same key, you at most have to search a billion elements. And a billion differs from 1 only by a constant - so that's O(1). Second, it's especially not true in 5.8.2 because it will increase the hash size (which leads to a different hash function) when the chains get too large.

    Next time, could you please get your facts straight before posting FUD?

    Abigail

      "And a billion differs from 1 only by a constant - so that's O(1)"

      You obviously don't understand what O(1) means.

      Say we have an array of 1 billion elements. Let's look at two different search algorithms:

      1. Search from beginning to end, going thru each element one by one, until hit what you are searching for. In the worst case (the element is at the end of the array), you have to hit 1 billion elements, but according to you, that's O(1). I say it is O(n). We never put a restriction saying that an array can at most contain 1 billion elements (so the size of an array in general is not a constant, although it is a constant for a given array at one given observation point.)
      2. Do a binay search, in the worst case, you have to hit log2(1 billion) ~ 30 times. I call this O(log2(n)), according to you it is also O(1).

      As everyone knows, the performance of those two approaches are so different, but according to your theory, they are both O(1)! The math here is so off! Well... I certainly don't mind if you insist your idea, but please don't confuse the general public.

      What you said would be right, if we put a restriction saying that a hash can contain at most 1 billion elements. As O(1 billion) has the same complexity as O(1), even though 1 billion is much bigger than 1.

      However O(n) is more complex than O(1 billion), even comparing with O(1 billion ** 1 billion), O(n) is still more complex. Why? because n is a variable, which can go to unlimit. 1 billion ** 1billion is huge, but n is going to unlimit, and evetually it will pass 1 billion ** 1 billion. In our context, please remember that, the size of a hash is a variable (that potentially goes to unlimit), and your analysis has to reflect this fact. Don't confuse it with the size of a given hash at a given time.

        All~

        I am just refering to the two posts immediately above this, but I must point out that pg is correct. Despite what the points on either node may say...

        The size of a hashtable is a variable (usually n), and the pathelogical case of inserting everything into the same bucket provides O(n) access for a simple hashtable.

        The only way in which Abigail would be correct is if there were guarantee that the overflow chain would NEVER exceed one billion entries.

        It is possible that the rehashing will prevent overflow chains from growing too large, but then one must consider the cost of rehashing the table. While that cost is not paid every time, it is likely a very large cost, and thus must be amortized across all calls to insert.

        In general, one could get O(1) access to a hash by ensuring that the overflow chains reach at most a constant length, but this will require rehashing when chains get too long. This would cause hash insertions to be greater than O(1).

        At heart it is a question of trading one cost for another...

        Boots
        ---
        Computer science is merely the post-Turing decline of formal systems theory.
        --???
        You obviously don't understand what O(1) means.
        Let's see. The definition of big O is:
        f(n) = O (g (n)) iff there are a M > 0 and a c > 0 such that for all m > M, 0 <= f(m) <= c * g (m). [1] [ +2] [3]
        I don't have any problem understanding with it. In layman terms, it means that a function f of n is in the order of g of n, if, and only if, there's a constant, such that if n gets large enough, the value of f is at most the value of g times said constant.
        Search from beginning to end, going thru each element one by one, until hit what you are searching for. In the worst case (the element is at the end of the array), you have to hit 1 billion elements, but according to you, that's O(1). I say it is O(n). We never put a restriction saying that an array can at most contain 1 billion elements (so the size of an array in general is not a constant, although it is a constant for a given array at one given observation point.)
        Hello? We never put a restriction on the size? Come again. What do you call:
        And still O(1) is not reachable, unless each element resolve a unique key ;-)
        That's a restriction of 1. You started out by putting restrictions on it, claiming that only if there's a restriction of a size of 1, the search algorithm is O (1). I on the other hand pointed out that as long as there is a restriction on the limit of the chain, it doesn't matter what the restriction is, 1, 14 (for 5.8.2), or a billion. If there's a restriction on the size, even with a linear search it's O (1). Here's a proof:
        Suppose the chain is limited to length K, where K is a constant, independent of the amount of keys in the hash. Searching for a key is a two step process: first we need to find the bucket the key hashes to, then we need to find the key in the associated chain. Finding the right bucket takes constant time. Traversing the chain takes at most K * e time, for some constant e. So, searching for the element takes at most:
                                       e * K + O (1),  e >= 0
             {definition of O()}  <=   e * K + d * 1,  e >= 0, d >= 0
             {arithmetic}         ==  (e * K + d) * 1, e >= 0, d >= 0
             {c == e * K + d}     ==   c * 1
             {c > 0}              ==   O (1).
                                                                q.e.d.   
        
        I won't deny the performance will be rather lousy, but it's still O (1). Which proves that big-Oh doesn't say everything.
        [1]
        Cormen, Leiserson, and Rivest: Introduction to Algorithms. MIT Press, 1990. pp 26.
        [2]
        Knuth: The Art of Computer Programming, Volume 1: Fundamental Algorithms. Third Edition. Addison-Wesley, 1997. pp 107.
        [3]
        Sedgewick, and Flajolet: Analysis of Algorithms. Addison-Wesley, 1996. pp 4.

        Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://307387]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (19)
As of 2014-09-30 19:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (381 votes), past polls