Re: A short meditation about hash search performance by liz (Monsignor) on Nov 15, 2003 at 21:37 UTC 
Perldelta 5.8.2 says:
The hash randomisation introduced with 5.8.1 has been amended. It
transpired that although the implementation introduced in 5.8.1 was source
compatible with 5.8.0, it was not binary compatible in certain cases. 5.8.2
contains an improved implementation which is both source and binary
compatible with both 5.8.0 and 5.8.1, and remains robust against the form of
attack which prompted the change for 5.8.1.
What it doesn't say, is that an adaptive algorithm has been implemented that will cause a rehashing of keys if any list (of identical hash keys) becomes too long. At least, that's what I remember from following the discussion from a distance. You might want to check out the p5p archives for specifics.
Liz  [reply] 

Thanks liz for the addon information.
This surely shortens the length of the longer queue(s), if it kicks in at the right time. So what it says is that the chance run into the worst analysis I given, is probably reduced.
However this does not affect the analysis on average performance.
And still O(1) is not reachable, unless each element resolve a unique key ;) (If that's the case, the document liz provided shall not be there, as the queue length would always be 1, and there is no need to shorten it. The fact there is such piece of info there, clearly indicates the opposite.)
Update:
Have read liz's reply and her update, especially her update, yes, I agree that Perl must only kick in the rehash base on certain carefully calculated justification, considering the cost of the rehash.
The interesting and myterous part is what that justification is...(in a private chatting, liz pointed me to hv.c and HV_MAX_LENGTH_BEFORE_SPLIT)
 [reply] 

So what it says is that the chance run into the worst analysis I given, is probably reduced.
Indeed. The impetus for the random key hashing scheme, was the potential for a DOS attack when a fixed key hashing scheme was used. So 5.8.1 introduced a random seed for hashing keys. However, for long running perl processes (think mod_perl), it was thinkable that the hash seed was "guessable" from performance of the program on various inputs. Since there was a binary compatibility issue as well, schemes were tried out to fix both.
Once people realized you're really talking about a general performance issue, it started to make sense to make the algorithm selfadapting depending on the length of the lists of identical hash keys.
AbigailII did a lot of benchmarking on it. Maybe AbigailII would like to elaborate?
Liz
Update:
(If that's the case, the document liz provided shall not be there, as the queue length would always be 1 ...
A same hash key list length of 1 for all hash keys, would be optimal if there were no other "costs" involved. However, the rehashing of existing keys is not something to be done lightly, especially if the number of existing keys is high. So you need to find the best possible combination of same hash key list length and rehashing. In that respect, the ideal same hash key list length is not 1!
 [reply] 


Once again you have posted a meditation in which you have made claims about Perl performance which differs vastly from reality. Again, a little bit of research on your part would have revealed the rehashing algorithm in place to deal with hash collisions. My suggestion for you is to read through the Perl source tree, before you post about perceived issues or dogma relating to Perl performance.
 [reply] 

And still O(1) is not reachable, unless each element resolve a unique key ;)
Man, this is *so* wrong. First of all, the above statement is not for hashes in general. Even if a billion elements hash
to the same key, you at most have to search a billion elements. And a billion differs from 1 only by a constant 
so that's O(1). Second, it's especially not true in 5.8.2
because it will increase the hash size (which leads to a
different hash function) when the chains get too large.
Next time, could you please get your facts straight before
posting FUD?
Abigail
 [reply] 




Re: A short meditation about hash search performance by Anonymous Monk on Nov 15, 2003 at 21:58 UTC 
The epected or average search time is O(1) under reasonable assumptions,
the worst case is O(n) for the fully degenerate case. Insertion and
deletion can be done in O(1) worst case time. Most people just state
that basic hash operations can be done in O(1) average time.
 [reply] 

Well, if you can't do queries in O(1) time, you can't do
deletes in O(1) time (because to delete something, you first
need to find it), and you can only do inserts in O(1) if
you accept duplicates  which Perl hashes don't.
Abigail
 [reply] 

Perhaps I need to rereview the analysis of CLR1990*, because they clearly
state on page 223:
The worstcase running time for insertion is O(1). For searching, the
worstcase running time is proportional to the length of the list: we
shall analyze this more closely below. Deletion of an element x can be
accomplished in O(1) time if the lists are doubly linked.
Note, I wasn't referring to perl's hashes in particular, just hashes in
general.
* Cormen, Leiserson, Rivest 1990: Introduction to Algorithms. MIT Press,
Cambridge Massachusetts.
 [reply] 


Re: A short meditation about hash search performance by Courage (Parson) on Nov 16, 2003 at 13:27 UTC 
Best search performance could be reached as O(log(N)) by using balanced tree. You could implement this as a perl module, although this will be much slower than internal implementation O(N).
Also do not forget that adding into search list also counts, and I think adding into a hash would cost O(1)
It is know that commonly used in C library qsort is also O(N*N) in worst case and something like O(N*LOG(N)) in real life.
Also sorting method changed to mergesort in perl5.8.0 and it was qsort for earlier perls.
Best regards, Courage, the Cowardly Dog
 [reply] 
Re: A short meditation about hash search performance by tilly (Archbishop) on Nov 17, 2003 at 16:40 UTC 
You appear to have a linguistic confusion about bigO notation, followed up by a common misconception of what O(1) means.
A function f(n) is O(1) if there are constants K and N such that for all n>N, f(n)<K. Plenty of functions other than straightforward constants meet that definition.
An algorithm is not bigO of anything. Only functions are. When it comes to hashes, the following three statements can be correct:
 The average hash search performance is O(1).
 The worst case hash search performance is O(n).
 The average hash search performance is O(n).
The first statement is why people use hashes. On average  and usually we are average  they are fast. The second is what you were pointing out as a correction explaining why the first is wrong. It doesn't correct it, it is an entirely distinct point. The third statement is true because bigO notation is only about the existence of upper bounds. I point it out to mention that common use of bigO notation among hackers is distinctly different from what you will find in Knuth and other official references.
I should note that technically speaking, Perl's hash performance isn't bigO of anything. Perl's hashing algorithms break down for datasets that do not fit in current memory and cannot be addressed by 32 or 64bit pointers (depending on the platform). I would have to look, but I think that the hash function won't scale beyond a billion or so buckets. Good luck finding someone who cares though.
Silly trivia. Following a pointer is not really O(1). The amount of work taken depends on how long the pointer is, and therefore the size of the possible address space. People only notice this when they are on a platform where they have the choice of working with two different sizes of data representation, like 16 vs 32 bit. Or 32 vs 64 bit. Going to the larger size brings with it a necessary speed hit.  [reply] 

Man you were so hard, I think he meant the functions behind the algorithms responded to O(log2(n)) asymptotic performance. The big o notation has many uses in other fields but in computing is used with its basic meaning, that's to remark the asymptotic character of the expression enclosed for the function being referred.
Man it seemed you were dying to get the book out of the shelf and spit out some theoretical speech. I think you got the point first time, didn't you?
Peace
 [reply] 

There are so many things to respond to here that I have no idea where to begin. So I'll list them randomly:
 Where do you get the O(log2(n)) from? With a flat memory model, hashing algorithms are not O(log2(n)).
 Reread my post and you'll see that I explicitly acknowledge the fact that hackers do not exactly use bigO notation in the way that Knuth did. Why do you think that he said otherwise?
 Reread the root post and you'll discover that pg both misunderstood bigO notation as used by Knuth, and as used by hackers. As far as most people are concerned, a hash lookup is O(1). That he thought otherwise was due to a misunderstanding on his part about what bigO notation means.
 Reread the root post and you'll find lots of incorrect attempted pedantry. When someone tries to get pedantic, I think it is fair and reasonable to be pedantic back.
 [reply] 
Re: A short meditation about hash search performance by Jenda (Abbot) on Sep 10, 2007 at 16:16 UTC 
I believe you mean "the number of buckets" not the number of valid hash keys ... there is a catch though ... the m usually depends on n! How exactly, depends, but I beleive in the Perl builtin implementation its 2*n rounded to the nearest power of 2. Which means your o(n/2m) becomes o(n/4n) wich becomes o(1/4) and ... well ... constants are irrelevant in the o() notation thus it's o(1). QED.
The average case of hash lookup even with the naive implementation is o(1), the worst case is O(n) for the naive implementation, but there are ways around that. They make the implementation more complex and insertion more costly, but they are possible. You can rehash once the list in a bucket gets too long or you can use binary trees instead of lists in the buckets.
 [reply] 

 [reply] 