![]() |
|
Syntactic Confectionery Delight | |
PerlMonks |
Re: Re: Re: Slowness when inserting into pre-extended arrayby tilly (Archbishop) |
on Jul 19, 2003 at 20:41 UTC ( [id://275932]=note: print w/replies, xml ) | Need Help?? |
I strongly suspect that you have fallen into the common trap of taking a fact that you know and attempting to apply it semi-randomly where it doesn't apply. This trap is particularly bad because you have enough half-digested knowledge to sound authoritative, which can help give rise to myths. Therefore let me engage in some rumor control. First let me demonstrate that your theory cannot apply in this case. The worst-case hashing behaviour that you describe shows up very rarely when someone isn't doing something deliberately to cause it - else hashes would not be useful. Furthermore the original code has many small hashes being created, and unless the nature of the data changes sharply, there is absolutely no reason to believe that hashes created out of lines past line 20,000 are any different than hashes created out of lines before then. Therefore this cannot apply. Secondly allow me to explain why we don't want your meme spread. It is bad because you make people scared to use large hashes. This is inappropriate - hashes normally work just fine until you run out of memory. That is not guaranteed behaviour - there are cases where a given hashing algorithm won't work - but unless you fear someone being malicious to you, there is no need to worry about using them. Thirdly allow me to explain some fundamental misunderstandings that you seem to have. Hashes assign keys pseudo-randomly to buckets. Perl dynamically reallocates buckets so that the number of keys per bucket stays in a fixed range. Therefore if the hashing algorithm works (Perl's generally will unless the keys are maliciously chosen) the odds of a collision stays the same no matter how many keys there are. There are always a few collisions. Searching through buckets with them is part of the average amortized cost of doing a hash lookup. Large or small plays no significant role. But if someone is malicious, then keys are going to all go into the same bucket. And with Perl's linked-list implementation for buckets (alternatives to that exist - they have worse average performance though so they are not used) causing a quadratic performance problem. However quadratic performance with only a small amount of data performs just fine. In order to find a performance problem you have to both be malicious and you have to throw lots of data in. Which is why when you heard about this attack you heard about lots of data. Not because lots of data is normally bad for hashes. But because this attack will only work if, among other things, you throw lots of data into it. In summary, this performance attack is real. Go ahead and warn people about it. But it isn't something that you should look for right away when you run into performance problems.
In Section
Seekers of Perl Wisdom
|
|