Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

how are the number of buckets in a perl hash chosen

by david2008 (Scribe)
on Jun 09, 2014 at 07:37 UTC ( #1089235=perlquestion: print w/ replies, xml ) Need Help??
david2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I take now a refresher on algorithms with the coursera course https://class.coursera.org/algo-005 given by Tim Roughgarden.
There is a whole week devoted to hash tables and bloom filters.
He mentioned there that the numbers of buckets in a hash should be a prime number, which is not close to a power of 2 or 10.
I use hashes in a perl on a daily basis and wanted to see how the theory is translated to the real world :-).
Firs i created a small hash %c= (a=>3,4=>5)
and saw that the number of buckets are 8. Then i wrote a small code:
use strict; use warnings; my %h; foreach my $i (0..10000) { $h{$i} = $i+10; } my $j = %h; print "$j";
and got the result 7391/16384.
Apparently the number of buckets are not prime and are divided by two.
I know from my experience that the perl hashes work amazingly fast and efficient. So how can this be explained?
Thanks,
David

Comment on how are the number of buckets in a perl hash chosen
Download Code
Re: how are the number of buckets in a perl hash chosen
by Anonymous Monk on Jun 09, 2014 at 07:52 UTC
      If i did understand exactly, from your example, that the power of 2 should be a prime.
      I tried it on different examples and saw that this is not correct every time.
      perl -E "%c = (3=>5);keys(%c) = 33;say scalar(%c)" 1/64
      Maybe i did not understand it 100%.
      I looked at perldata and keys and saw indeed that the number of buckets is a power of 2 but it was not explained why.

        If i did understand exactly, from your example, that the power of 2 should be a prime.

        I don't know, wolfram says "prime factorization"...

        I looked at perldata and keys and saw indeed that the number of buckets is a power of 2 but it was not explained why.

        Oh, why? because :) its coded that way :D

        newsize &= ~(newsize & (1 + ~newsize)); /* get proper power of 2 */ .. Newxz(a, PERL_HV_ARRAY_ALLOC_BYTES(newsize), char);

        Programmers just love to do things in powers of two

        perl hash presize why power of 2 How Hashes Really Work - Perl.com

        In principle, a hashing function returns an array index directly; in practice, it is common to use its (arbitrary) return value modulo the number of buckets as the actual index. (Using a prime number of buckets that is not too close to a power of two tends to produce a sufficiently uniform key distribution.)
        A Hash Function for Hash Table Lookup
        The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!).
        Hash table
        In the case that the array size is a power of two, the remainder operation is reduced to masking, which improves speed, but can increase problems with a poor hash function.

        So there you have it, power of two, its for the speed

Re: how are the number of buckets in a perl hash chosen
by oiskuu (Friar) on Jun 09, 2014 at 08:42 UTC

    Choosing a good hash function is a bit of a black magic. Modulo prime might yield a better bucket distribution in simple cases with simple hash functions, e.g. when the keys fall into regular label+number pattern.

    In practice, the modulo operation (by a variable) is often insufferably slow compared to arithmetic shift or masking. Therefore, power-of-two bucket sizes are preferred, together with a robust hash function. Strong (and randomized) hash function is a necessity nowadays. Does the course mention security aspects of generic hash implementations, such as resistance to DoS attacks?

      It did speak generally about security without going into the gory details. We learned about universal hashing and why it is important.
      It was not mentioned that the modulo operation is slow. The fact that the perl hash function manages to spread the entries evenly into buckets, despite the fact that the number of buckets is even is surprising.
      I guess that i will have to read the source code to understand exactly how this happens.
      This is also what Tim Roughgarden suggests. He says that if we want to understand a data structure well, we should either read a text book about the issue, or an open source implementation.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1089235]
Front-paged by boftx
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2014-12-21 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (104 votes), past polls