Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

When does Perl double the number of buckets in hash?

by Anonymous Monk
on Nov 30, 2011 at 22:13 UTC ( #940953=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I've been trying to figure out when Perl decides to double the total number of buckets for hash. According to the article "How Hashes Really Work" by Abhijit Menon-Sen, this happens when "the number of keys exceeds the number of buckets". I wrote a simple script to test that assumption.
$old_total_bucket = 8; $size = 0; for ($i = 0; $i < 65537 ; $i++ ) { $size++; $words{"$i"} = 1; $c = %words; ($fill_bucket,$total_bucket) = split(/\//,$c); if($total_bucket != $old_total_bucket) { $old_total_bucket = $total_bucket; print "For the $size keys, hash parameters are: $c \n"; } }
I print out the number of keys for all cases when the total number of buckets doubles. I use version 5.10.0. Initial part of the output is:

For the 9 keys, hash parameters are: 8/16
For the 16 keys, hash parameters are: 15/32
For the 33 keys, hash parameters are: 29/64
For the 64 keys, hash parameters are: 53/128

It seems sometimes Perl doubles the number of buckets when the number of keys is greater than the number of buckets. Sometimes the doubling occurs when the number of keys is equal to the number of buckets (say, for 16 and 64 keys). And there are cases when the doubling occurs when the number of keys is greater than the number of buckets plus one (for 32770 = 2^15 + 2 keys). I tried looking into hv.c file (for v5.12.2) and found the following lines:
} else if (xhv->xhv_keys > (IV)xhv->xhv_max) { hsplit(hv);
So this seems to indicate that doubling occurs when "the number of keys exceeds the number of buckets". However, the output of my script shows that could be some other criterion involved. So when does Perl double the number of buckets in hash? Thanks a lot for any suggestions!

Comment on When does Perl double the number of buckets in hash?
Select or Download Code
Re: When does Perl double the number of buckets in hash?
by BrowserUk (Pope) on Nov 30, 2011 at 22:37 UTC

    Remember that keys != buckets. Sometimes, and more frequently than you might imagine especially with small hashes, two or more keys will hash to the same bucket.

    That said, from a cursory inspection, it seems to be when the number of buckets in use exceeds 2/3rds of the buckets available:

    ++$h{$_ } and print scalar keys %h, ' : ', scalar %h for 'aaaa' .. 'zz +zz';; 1 : 1/8 2 : 2/8 3 : 2/8 ## hash colision 4 : 2/8 ## hash colision 5 : 3/8 6 : 4/8 7 : 5/8 ## 5/8ths 8 : 6/16 9 : 6/16 10 : 7/16 11 : 7/16 ## hash colision 12 : 8/16 13 : 9/16 14 : 9/16 ## hash colision 15 : 9/16 ## hash colision 16 : 10/16 ## 5/8ths 17 : 15/32 18 : 15/32 ## hash colision 19 : 16/32 20 : 16/32 ## hash colision 21 : 16/32 ## hash colision 22 : 17/32 23 : 17/32 ## hash colision 24 : 18/32 25 : 19/32 26 : 19/32 ## hash colision 27 : 20/32 28 : 20/32 ## hash colision 29 : 20/32 ## hash colision 30 : 20/32 ## hash colision ## 5/8ths 31 : 21/32 32 : 25/64 ... 60 : 40/64 61 : 40/64 ## hash colision 62 : 40/64 ## hash colision 5/8ths 63 : 41/64 ## 2/3rds 64 : 51/128 ... 128 : 85/128 ## 2/3rds 129 : 106/256 130 : 107/256

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      As shown by the code the OP posted, it actually happens when the number of elements (including the newly inserted element) is equal to the number buckets. What the OP missed is that it only happens if there's a collision.

      Using your numbers:

      1 : 1/8 2 : 2/8 3 : 2/8 ## hash colision 4 : 2/8 ## hash colision 5 : 3/8 6 : 4/8 7 : 5/8 8 : 6/16 ## 8 == 8 => split 9 : 6/16 10 : 7/16 11 : 7/16 ## hash colision 12 : 8/16 13 : 9/16 14 : 9/16 ## hash colision 15 : 9/16 ## hash colision 16 : 10/16 ## 16 == 16 => split 17 : 15/32 18 : 15/32 ## hash colision 19 : 16/32 20 : 16/32 ## hash colision 21 : 16/32 ## hash colision 22 : 17/32 23 : 17/32 ## hash colision 24 : 18/32 25 : 19/32 26 : 19/32 ## hash colision 27 : 20/32 28 : 20/32 ## hash colision 29 : 20/32 ## hash colision 30 : 20/32 ## hash colision 31 : 21/32 32 : 25/64 ## 32 == 32 => split ... 60 : 40/64 61 : 40/64 ## hash colision 62 : 40/64 ## hash colision 5/8ths 63 : 41/64 64 : 51/128 ## 64 == 64 => split ... 128 : 85/128 129 : 106/256 ## 128 == 128 => split 130 : 107/256

      There's a second condition that causes a split: A degenerate hash is detected. A degenerate hash is one that has a bucket with so many element as to make it slow to find keys in that bucket. I didn't try to determine the exact condition for when this occurs.

        Indeed. Thank you for the correction. I guess I was concentrating on the wrong column.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: When does Perl double the number of buckets in hash?
by ikegami (Pope) on Dec 01, 2011 at 00:49 UTC

    There's no reason to split after inserting into an empty bucket. A split only happens if 1) there's a collision (counter is true) and 2) the number of elements (including the newly inserted element) is greater than the number of buckets (xhv->xhv_keys > (IV)xhv->xhv_max> is true).

Re: When does Perl double the number of buckets in hash?
by Anonymous Monk on Dec 02, 2011 at 06:18 UTC
    for the following code: %hash=(1,2,3,4,5,6,7,8,9,10,11,12,15,16,22,34,88,99); $val=%hash; print $val; the output is :7/8 can some-one Explain???
      the output is :7/8 can some-one Explain???

      No! It is easy to see that keys 11 & 15 each hash to the same bucket as one of the previous keys, as the buckets used number doesn't increase when they are added:

      $hash{ $_->[0] } = $_->[1] and print "@$_\t", scalar keys %hash, scala +r %hash for [1,2],[3,4],[5,6],[7,8],[9,10],[11,12],[15,16],[22,34],[88,99] +;; 1 2 1 1/8 3 4 2 2/8 5 6 3 3/8 7 8 4 4/8 9 10 5 5/8 11 12 6 5/8 * 15 16 7 5/8 * 22 34 8 6/8 88 99 9 7/8 ???

      But why the size of the hash was not doubled when the key count equalled the number of buckets is not clear. It seems to suggest that this is also not the complete story here?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        In your example, doubling does not occur because there is no collision. As I understand, the Perl code in hv.c first checks if there is a collision. If there is, the code compares the total number of keys (including the one we just added) with the number of buckets. If the former is greater or equal to the latter, then the number of buckets is doubled. Otherwise, Perl checks the total number of keys in that particular bucket. If there are more than HV_MAX_LENGTH_BEFORE_SPLIT (set to 14) keys in that bucket, the number of buckets is also doubled.

        I still don't understand one thing, though. In hv.c code we have the comparison (xhv->xhv_keys > (IV)xhv->xhv_max). This seems to suggest that doubling occurs only when the number of keys (including the new one) is more than the number of buckets. But as I shown in the original post, the doubling can occur even when the number of keys equals to the number of buckets.

        As you can see, the used number of buckets went up (6⇒7), so a collision did not occur, so there was no reason for a split.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://940953]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2014-12-21 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (106 votes), past polls