Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Re: Re: Slowness when inserting into pre-extended array

by tilly (Archbishop)
on Jul 19, 2003 at 20:41 UTC ( [id://275932]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Slowness when inserting into pre-extended array
in thread Slowness when inserting into pre-extended array

I strongly suspect that you have fallen into the common trap of taking a fact that you know and attempting to apply it semi-randomly where it doesn't apply. This trap is particularly bad because you have enough half-digested knowledge to sound authoritative, which can help give rise to myths.

Therefore let me engage in some rumor control.

First let me demonstrate that your theory cannot apply in this case. The worst-case hashing behaviour that you describe shows up very rarely when someone isn't doing something deliberately to cause it - else hashes would not be useful. Furthermore the original code has many small hashes being created, and unless the nature of the data changes sharply, there is absolutely no reason to believe that hashes created out of lines past line 20,000 are any different than hashes created out of lines before then. Therefore this cannot apply.

Secondly allow me to explain why we don't want your meme spread. It is bad because you make people scared to use large hashes. This is inappropriate - hashes normally work just fine until you run out of memory. That is not guaranteed behaviour - there are cases where a given hashing algorithm won't work - but unless you fear someone being malicious to you, there is no need to worry about using them.

Thirdly allow me to explain some fundamental misunderstandings that you seem to have. Hashes assign keys pseudo-randomly to buckets. Perl dynamically reallocates buckets so that the number of keys per bucket stays in a fixed range. Therefore if the hashing algorithm works (Perl's generally will unless the keys are maliciously chosen) the odds of a collision stays the same no matter how many keys there are. There are always a few collisions. Searching through buckets with them is part of the average amortized cost of doing a hash lookup. Large or small plays no significant role.

But if someone is malicious, then keys are going to all go into the same bucket. And with Perl's linked-list implementation for buckets (alternatives to that exist - they have worse average performance though so they are not used) causing a quadratic performance problem. However quadratic performance with only a small amount of data performs just fine. In order to find a performance problem you have to both be malicious and you have to throw lots of data in.

Which is why when you heard about this attack you heard about lots of data. Not because lots of data is normally bad for hashes. But because this attack will only work if, among other things, you throw lots of data into it.

In summary, this performance attack is real. Go ahead and warn people about it. But it isn't something that you should look for right away when you run into performance problems.

  • Comment on Re: Re: Re: Slowness when inserting into pre-extended array

Replies are listed 'Best First'.
Re: Re: Re: Re: Slowness when inserting into pre-extended array
by liz (Monsignor) on Jul 19, 2003 at 21:44 UTC
    If there was a trap, it was the trap of not reading the question correctly and indeed zeroing in on the limit mentioned when I shouldn't have. I don't want to raise any myths. Which is why I'm glad you're correcting me.

    And for sounding authoritative. I'm primarily older then many of my fellow Perl monks, but definitely not more experienced in the use of Perl. So:

    Hashes are OK for any number of keys until you run out of memory Even if this wouldn't be the case, is there a better way to do it (without resorting to something like Judy)?

    However, I think you're wrong with respect to point 3. On the one hand you're saying that the number of keys per bucket stays in the same range. Then how can there be a "worst case" scenario? If Perl would be able to always keep the number of keys per bucket roughly the same, how could there be a worst case scenario?

    Liz

      The aspect of authoritative that I was talking about was your giving enough technical information which is hard for others to judge that it looked like you must know a lot about what you are talking about. It is independent of age. Of course I am doing the same - throwing out lots of technical information - but hopefully I actually understand it and will be able to answer any follow-up questions until you are comfortable that you understand it as well.

      As for what is better than hashing, define better. For key/value pairs with a flat memory model, hashing gives the best average performance of any simple data structure that I know of. That is why people use them. If an obviously superior alternative existed, then everyone would use that and nobody would have heard of hashing. But hashing has bad worst-case performance, doesn't keep things in order (often a requirement), and blows your cache badly. If you care about worst-case performance, having things come out in order, or if memory cache effects matter to you, then alternatives can beat hashing. (My understanding from their docs is that Judy arrays can beat hashes because of doing a lot of work to pay attention to memory cache effects. Similarly for large data structures backed on disk a BTREE beats hashing because you hit disk far less.)

      And finally I assure you that I am not wrong about point 3. A hashing algorithm that is working well appears to throw keys into buckets randomly. The resulting distribution of numbers of keys per bucket is a Poisson distribution, with lambda depending on the ratio of numbers of keys to buckets. If you add too many keys, Perl increases the number of buckets so that lambda stays in a fixed range.

      The fly in the ointment is the phrase, appears to throw keys into buckets randomly. As we all know, computers don't do "random" very well, and besides which, when we see the same key later we had better go into the right bucket. There is therefore no way to guarantee that you won't have lots of keys in one bucket, and the rest of the buckets empty. The odds against this happening by accident are astronomical, but it is not impossible.

      And that is the worst case scenario - that the hashing algorithm throws all keys into one bucket. Which happens (almost) never by accident, but which a malicious attacker can try to make happen intentionally. And if someone does that, then it does Perl absolutely no good to provide the hash with a million buckets to use - everything is going into just one and the well-designed hash implementation degrades into scanning a huge linked list.

      Hence the defence of randomizing the hashing algorithm itself so that nobody can predict what hashing algorithm they are dealing with, and therefore what set of keys will give it problems...

      Hashes are OK for any number of keys until you run out of memory
      That's not actually universally true--large numbers of hash keys trigger pathological behaviour on many versions of glibc. The bug's been fixed, but a lot of those systems are still out there because people are understandably loathe to update such a fundamental part of their system.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://275932]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2025-06-13 16:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.