Pathologically Eclectic Rubbish Lister PerlMonks

### When does Perl double the number of buckets in hash?

 on Nov 30, 2011 at 22:13 UTC Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I've been trying to figure out when Perl decides to double the total number of buckets for hash. According to the article "How Hashes Really Work" by Abhijit Menon-Sen, this happens when "the number of keys exceeds the number of buckets". I wrote a simple script to test that assumption.
```\$old_total_bucket = 8;
\$size = 0;
for (\$i = 0; \$i < 65537 ; \$i++ ) {
\$size++;
\$words{"\$i"} = 1;
\$c = %words;
(\$fill_bucket,\$total_bucket) = split(/\//,\$c);
if(\$total_bucket != \$old_total_bucket) {
\$old_total_bucket = \$total_bucket;
print "For the \$size keys, hash parameters are: \$c \n";
}
}
I print out the number of keys for all cases when the total number of buckets doubles. I use version 5.10.0. Initial part of the output is:

For the 9 keys, hash parameters are: 8/16
For the 16 keys, hash parameters are: 15/32
For the 33 keys, hash parameters are: 29/64
For the 64 keys, hash parameters are: 53/128

It seems sometimes Perl doubles the number of buckets when the number of keys is greater than the number of buckets. Sometimes the doubling occurs when the number of keys is equal to the number of buckets (say, for 16 and 64 keys). And there are cases when the doubling occurs when the number of keys is greater than the number of buckets plus one (for 32770 = 2^15 + 2 keys). I tried looking into hv.c file (for v5.12.2) and found the following lines:
```    } else if (xhv->xhv_keys > (IV)xhv->xhv_max) {
hsplit(hv);
So this seems to indicate that doubling occurs when "the number of keys exceeds the number of buckets". However, the output of my script shows that could be some other criterion involved. So when does Perl double the number of buckets in hash? Thanks a lot for any suggestions!

Replies are listed 'Best First'.
Re: When does Perl double the number of buckets in hash?
by BrowserUk (Pope) on Nov 30, 2011 at 22:37 UTC

Remember that keys != buckets. Sometimes, and more frequently than you might imagine especially with small hashes, two or more keys will hash to the same bucket.

That said, from a cursory inspection, it seems to be when the number of buckets in use exceeds 2/3rds of the buckets available:

```++\$h{\$_ } and print scalar keys %h, ' : ', scalar %h for 'aaaa' .. 'zz
+zz';;

1  :  1/8
2  :  2/8
3  :  2/8           ## hash colision
4  :  2/8           ## hash colision
5  :  3/8
6  :  4/8
7  :  5/8           ## 5/8ths
8  :  6/16
9  :  6/16
10  :  7/16
11  :  7/16        ## hash colision
12  :  8/16
13  :  9/16
14  :  9/16        ## hash colision
15  :  9/16        ## hash colision
16  :  10/16       ## 5/8ths
17  :  15/32
18  :  15/32       ## hash colision
19  :  16/32
20  :  16/32       ## hash colision
21  :  16/32       ## hash colision
22  :  17/32
23  :  17/32       ## hash colision
24  :  18/32
25  :  19/32
26  :  19/32       ## hash colision
27  :  20/32
28  :  20/32       ## hash colision
29  :  20/32       ## hash colision
30  :  20/32       ## hash colision ## 5/8ths
31  :  21/32
32  :  25/64
...
60  :  40/64
61  :  40/64       ## hash colision
62  :  40/64       ## hash colision 5/8ths
63  :  41/64       ##  2/3rds
64  :  51/128
...
128  :  85/128     ## 2/3rds
129  :  106/256
130  :  107/256

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

As shown by the code the OP posted, it actually happens when the number of elements (including the newly inserted element) is equal to the number buckets. What the OP missed is that it only happens if there's a collision.

```1  :  1/8
2  :  2/8
3  :  2/8           ## hash colision
4  :  2/8           ## hash colision
5  :  3/8
6  :  4/8
7  :  5/8
8  :  6/16         ## 8 == 8 => split
9  :  6/16
10  :  7/16
11  :  7/16        ## hash colision
12  :  8/16
13  :  9/16
14  :  9/16        ## hash colision
15  :  9/16        ## hash colision
16  :  10/16       ## 16 == 16 => split
17  :  15/32
18  :  15/32       ## hash colision
19  :  16/32
20  :  16/32       ## hash colision
21  :  16/32       ## hash colision
22  :  17/32
23  :  17/32       ## hash colision
24  :  18/32
25  :  19/32
26  :  19/32       ## hash colision
27  :  20/32
28  :  20/32       ## hash colision
29  :  20/32       ## hash colision
30  :  20/32       ## hash colision
31  :  21/32
32  :  25/64       ## 32 == 32 => split
...
60  :  40/64
61  :  40/64       ## hash colision
62  :  40/64       ## hash colision 5/8ths
63  :  41/64
64  :  51/128      ## 64 == 64 => split
...
128  :  85/128
129  :  106/256    ## 128 == 128 => split
130  :  107/256

There's a second condition that causes a split: A degenerate hash is detected. A degenerate hash is one that has a bucket with so many element as to make it slow to find keys in that bucket. I didn't try to determine the exact condition for when this occurs.

Indeed. Thank you for the correction. I guess I was concentrating on the wrong column.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Re: When does Perl double the number of buckets in hash?
by ikegami (Pope) on Dec 01, 2011 at 00:49 UTC

There's no reason to split after inserting into an empty bucket. A split only happens if 1) there's a collision (counter is true) and 2) the number of elements (including the newly inserted element) is greater than the number of buckets (xhv->xhv_keys > (IV)xhv->xhv_max> is true).

Re: When does Perl double the number of buckets in hash?
by Anonymous Monk on Dec 02, 2011 at 06:18 UTC
for the following code: %hash=(1,2,3,4,5,6,7,8,9,10,11,12,15,16,22,34,88,99); \$val=%hash; print \$val; the output is :7/8 can some-one Explain???
the output is :7/8 can some-one Explain???

No! It is easy to see that keys 11 & 15 each hash to the same bucket as one of the previous keys, as the buckets used number doesn't increase when they are added:

```\$hash{ \$_->[0] } = \$_->[1] and print "@\$_\t", scalar keys %hash, scala
+r %hash
for [1,2],[3,4],[5,6],[7,8],[9,10],[11,12],[15,16],[22,34],[88,99]
+;;
1 2      1 1/8
3 4      2 2/8
5 6      3 3/8
7 8      4 4/8
9 10     5 5/8
11 12    6 5/8 *
15 16    7 5/8 *
22 34    8 6/8
88 99    9 7/8 ???

But why the size of the hash was not doubled when the key count equalled the number of buckets is not clear. It seems to suggest that this is also not the complete story here?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In your example, doubling does not occur because there is no collision. As I understand, the Perl code in hv.c first checks if there is a collision. If there is, the code compares the total number of keys (including the one we just added) with the number of buckets. If the former is greater or equal to the latter, then the number of buckets is doubled. Otherwise, Perl checks the total number of keys in that particular bucket. If there are more than HV_MAX_LENGTH_BEFORE_SPLIT (set to 14) keys in that bucket, the number of buckets is also doubled.

I still don't understand one thing, though. In hv.c code we have the comparison (xhv->xhv_keys > (IV)xhv->xhv_max). This seems to suggest that doubling occurs only when the number of keys (including the new one) is more than the number of buckets. But as I shown in the original post, the doubling can occur even when the number of keys equals to the number of buckets.

As you can see, the used number of buckets went up (6⇒7), so a collision did not occur, so there was no reason for a split.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://940953]
Approved by philipbailey
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (11)
As of 2017-10-23 14:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My fridge is mostly full of:

Results (279 votes). Check out past polls.

Notices?