in reply to Does "preallocating hash improve performance"? Or "using a hash slice"?

If the hash is big enough it does make a difference:

C:\test>p1 $t = time; my %h; keys %h = 10e6; $h{$_}=1 for 1 .. 10e6; printf "%.6f +\n", time() -$t;; 15.792654 $t = time; my %i; $i{$_}=1 for 1 .. 10e6; printf "%.6f\n", time() -$t; +; 18.772236

That's about an 18% saving. And if you use hashes with 100s of millions of keys as I routinely do, then the saving get bigger.

Perl's hashes start off with room for 8 keys and then double and double again as they grow. Each time it doubles, new memory is allocated and the keys must be rehashed and the values copied over. By preallocating a 10 million key hash that process is avoided 21 times.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Does "preallocating hash improve performance"? Or "using a hash slice"?
  • Download Code

Replies are listed 'Best First'.
Re^2: Does "preallocating hash improve performance"? Or "using a hash slice"?
by Eily (Monsignor) on Feb 20, 2017 at 17:20 UTC

    While this is of course true, you missed vr's point about hash slices. With @h{ @keys } = @values; perl could use the size of the keys list to preallocate the hash. I guess this would turn out to be a waste of space if most of the keys are duplicates, which would explain why apparently setting all the keys "at once" seems to still require several reallocations.

      you missed vr's point about hash slices.

      Not so much "missed it" as chose not to address it.

      With @h{ @keys } = @values; perl could use the size of the keys list to preallocate the hash

      The problem with that is that in order to gain the 18% saving from the preallocation of my 10,000,000 key hash, the hash slice would:

      1. Allocate stack space to accommodate 20,000,000 items on the stack.
      2. Turn the 10,000,000 element array, @keys, into a 10,000,000 item list on the stack.
      3. Turn the 10,000,000 element array, @values, into a 10,000,000 item list also the stack.
      4. Process those two lists in parallel adding each key/value pair individually; doubling (and redoubling), hashing (and rehashing) as required.
      5. Free the space required to accommodate the 20,000,000 items on the stack back to the memory pool.

      It's a lot more complicated than that, but it is sufficiently accurate to make my point.)

      So, now the process not only has two copies of all the keys and values (in the arrays as well as the hash), it also has allocated enough space to store them all again on the stack.

      So that means the peak memory usage required to do the slice assignment is at least 4 times more than is eventually required to construct the hash itself; and the hash was not preallocated either!

      Slice assignment can be more efficient for relatively small hashes provided it is necessary to the rest of the code to have the keys and values stored in arrays as well. But if you are loading them into arrays, simply so that you can use list assignment to construct a hash from them, it is a disaster in terms of both memory usage and processing time.

      And for anything larger then pretty modest sized hashes ( a few 10s or low 100s of key/value pairs) then the cost (memory and time) of building the two lists on the stack far outweights any gain from moving the loop into C. Especially as the potential of having the hash pre-sized doesn't happen. (It probably could, but it doesn't.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.
      .

        I hope perl is not so inefficient that it copies all the content of a list when it needs to process it, when simply pointing to the already allocated elements would be enough. Even more so with arrays. I'd also expect some COW mechanism to remove a lot of allocating and copying. Keys might be more trouble though, as all data stored in them have to be stringified first (and I wouldn't be surprised to learn that hash keys always hold a copy of the initial string, since as far as I can tell they are not standard scalar values). I agree that the memory usage, and number of copies is certainly higher when you go all the way to slicing, but I don't expect "at least 4 times more" memory. Actually I remember finding out in a post a while ago that having the keys and values in two arrays takes more place than just having them in a hash (which kind of makes sense, order is extra information, although technically I think it's because keys can afford to just be strings, without all the extras of ordinary scalar values)

        Not so much "missed it" as chose not to address it.
        Well that's too bad, you did have an interresting answer: that if you need to gather your data in some form before populating your hash, you shouldn't expect to obtain the best result memory-wise and time-wise (because of the copies). So you probably don't need to worry about preallocation if you don't need to worry about slicing.