Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Slowness when inserting into pre-extended array

by Taulmarill (Deacon)
on Jul 19, 2003 at 19:33 UTC ( [id://275922]=note: print w/replies, xml ) Need Help??


in reply to Slowness when inserting into pre-extended array

donīt know if it helps (iīm kind of newbie too) but what do you think about

while (<DATA>) { $line = $_; ...do some stuff... }
instead of storing all data in @FILE.

Replies are listed 'Best First'.
Re: Re: Slowness when inserting into pre-extended array
by liz (Monsignor) on Jul 19, 2003 at 20:02 UTC
    This would solve some of the bloat, but probably not too much of the speed decrease that you've seen.

    What I think you're running into is a basic shortcoming of hashes in general, and Perl in particular. What happens is that each hash key is internally "hashed" into a 4-byte integer value. And these keys are then put in a sorted list, so you can quickly perform a binary search on it to check whether the key already exists or not.

    Ideally every different key should have a different hash value. Reality however shows that with huge amounts of keys, the chance of two different key values getting the same hash value, is actually quite large. This situation is then handled by creating a list of all the keys that have that same hash value. In some "worst case" scenarios, you might actually wind up with a single hash value for all keys given.

    The search in the linked list is linear and therefore needs more and more time for each key added.

    Now, the distribution of the hash keys in your dataset may be have some worst cases in it.

    So how can this be fixed? Good question. You could try changing the keys by adding some other, fixed string to the keys. This should change the distribution of the hash key values, hopefully reducing the worst case.

    If you're not stuck with the current Perl version, you might want to try out 5.8.1-RC1, as this has a new feature that will randomize the distribution of the hash keys. This in itself will not reduce the worst case scenario, but will make it a moving target, so you won't have that bad performance all of the time.

    This randomization was introduced because the delay that you experienced, is actually a way to create a Denial of Service attack for CGI-scripts.

    Hope this helps.

    Liz

      I strongly suspect that you have fallen into the common trap of taking a fact that you know and attempting to apply it semi-randomly where it doesn't apply. This trap is particularly bad because you have enough half-digested knowledge to sound authoritative, which can help give rise to myths.

      Therefore let me engage in some rumor control.

      First let me demonstrate that your theory cannot apply in this case. The worst-case hashing behaviour that you describe shows up very rarely when someone isn't doing something deliberately to cause it - else hashes would not be useful. Furthermore the original code has many small hashes being created, and unless the nature of the data changes sharply, there is absolutely no reason to believe that hashes created out of lines past line 20,000 are any different than hashes created out of lines before then. Therefore this cannot apply.

      Secondly allow me to explain why we don't want your meme spread. It is bad because you make people scared to use large hashes. This is inappropriate - hashes normally work just fine until you run out of memory. That is not guaranteed behaviour - there are cases where a given hashing algorithm won't work - but unless you fear someone being malicious to you, there is no need to worry about using them.

      Thirdly allow me to explain some fundamental misunderstandings that you seem to have. Hashes assign keys pseudo-randomly to buckets. Perl dynamically reallocates buckets so that the number of keys per bucket stays in a fixed range. Therefore if the hashing algorithm works (Perl's generally will unless the keys are maliciously chosen) the odds of a collision stays the same no matter how many keys there are. There are always a few collisions. Searching through buckets with them is part of the average amortized cost of doing a hash lookup. Large or small plays no significant role.

      But if someone is malicious, then keys are going to all go into the same bucket. And with Perl's linked-list implementation for buckets (alternatives to that exist - they have worse average performance though so they are not used) causing a quadratic performance problem. However quadratic performance with only a small amount of data performs just fine. In order to find a performance problem you have to both be malicious and you have to throw lots of data in.

      Which is why when you heard about this attack you heard about lots of data. Not because lots of data is normally bad for hashes. But because this attack will only work if, among other things, you throw lots of data into it.

      In summary, this performance attack is real. Go ahead and warn people about it. But it isn't something that you should look for right away when you run into performance problems.

        If there was a trap, it was the trap of not reading the question correctly and indeed zeroing in on the limit mentioned when I shouldn't have. I don't want to raise any myths. Which is why I'm glad you're correcting me.

        And for sounding authoritative. I'm primarily older then many of my fellow Perl monks, but definitely not more experienced in the use of Perl. So:

        Hashes are OK for any number of keys until you run out of memory Even if this wouldn't be the case, is there a better way to do it (without resorting to something like Judy)?

        However, I think you're wrong with respect to point 3. On the one hand you're saying that the number of keys per bucket stays in the same range. Then how can there be a "worst case" scenario? If Perl would be able to always keep the number of keys per bucket roughly the same, how could there be a worst case scenario?

        Liz

      This doesn't seem to have anything at all to do with hashes, as the only difference between the slow and fast versions of the program is that the slow version pushes onto an array while the fast version doesn't. Given perl's behaviour with array extension, I'd pretty much guarantee that the problem is the array push, with a lesser possibility being a process memory freelist fragmentation from all the hash allocations, but I'd doubt that one.
        Unless Perl's implementation of push has changed since I looked at it last, I strongly doubt that the implementation of push would cause performance problems. The power-of-two allocation mechanism that it uses means that, until you run out of memory, the amortized reallocation cost due to running out of space is constant per element added.

        Furthermore when I wrote a test script (Perl 5.8.0 running on Linux) that did what he did except that I made each line the same, I saw absolutely no performance degradation until my machine began paging data to disk. Even then it was a barely measurable slowdown until I got bored with watching it run. This strongly suggests that we are not looking at a poor internal Perl algorithm issue. (My past experience tells me that Linux pages very efficiently until it runs out of RAM and falls over. The Windows NT line - XP included - start showing dramatic slowdowns well before you would think they should be in trouble, but it is very hard to get them to fall over.)

      If you're concerned that this might be the case, try printing scalar(%hash). From perldata(1):
      If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals ""1/16"", which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://275922]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2025-06-14 21:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.