http://www.perlmonks.org?node_id=275955


in reply to Re: Re: Re: Re: Slowness when inserting into pre-extended array
in thread Slowness when inserting into pre-extended array

The aspect of authoritative that I was talking about was your giving enough technical information which is hard for others to judge that it looked like you must know a lot about what you are talking about. It is independent of age. Of course I am doing the same - throwing out lots of technical information - but hopefully I actually understand it and will be able to answer any follow-up questions until you are comfortable that you understand it as well.

As for what is better than hashing, define better. For key/value pairs with a flat memory model, hashing gives the best average performance of any simple data structure that I know of. That is why people use them. If an obviously superior alternative existed, then everyone would use that and nobody would have heard of hashing. But hashing has bad worst-case performance, doesn't keep things in order (often a requirement), and blows your cache badly. If you care about worst-case performance, having things come out in order, or if memory cache effects matter to you, then alternatives can beat hashing. (My understanding from their docs is that Judy arrays can beat hashes because of doing a lot of work to pay attention to memory cache effects. Similarly for large data structures backed on disk a BTREE beats hashing because you hit disk far less.)

And finally I assure you that I am not wrong about point 3. A hashing algorithm that is working well appears to throw keys into buckets randomly. The resulting distribution of numbers of keys per bucket is a Poisson distribution, with lambda depending on the ratio of numbers of keys to buckets. If you add too many keys, Perl increases the number of buckets so that lambda stays in a fixed range.

The fly in the ointment is the phrase, appears to throw keys into buckets randomly. As we all know, computers don't do "random" very well, and besides which, when we see the same key later we had better go into the right bucket. There is therefore no way to guarantee that you won't have lots of keys in one bucket, and the rest of the buckets empty. The odds against this happening by accident are astronomical, but it is not impossible.

And that is the worst case scenario - that the hashing algorithm throws all keys into one bucket. Which happens (almost) never by accident, but which a malicious attacker can try to make happen intentionally. And if someone does that, then it does Perl absolutely no good to provide the hash with a million buckets to use - everything is going into just one and the well-designed hash implementation degrades into scanning a huge linked list.

Hence the defence of randomizing the hashing algorithm itself so that nobody can predict what hashing algorithm they are dealing with, and therefore what set of keys will give it problems...

  • Comment on Re: Re: Re: Re: Re: Slowness when inserting into pre-extended array