Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Perl Internals: Hashes

by Kozz (Friar)
on Apr 14, 2004 at 20:40 UTC ( #345200=perlquestion: print w/ replies, xml ) Need Help??
Kozz has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks:

Though I've been using Perl for several years, I've always taken for granted the ease of using hashes in all their incarnations. Now I've returned to school (for a CS degree) and while taking many data structures classes, have been wondering more about what are the internals of Perl's hashes.

I've looked online to find more info on how perl hashing works, and though some articles explain the different hash functions used through different perl versions, it's not clear:

  1. What is the initial size of a hash in memory when created? (e.g. declared "my %hash;", but no data in it yet)
  2. What method is used to "grow" the hash? Doubling & re-hashing? Linear hashing? Something else?

I had written a small Perl script to perform a small task for an algorithms class, and the TA asked me about its true space efficiency versus another method which didn't use hashes, and I realized I couldn't tell him with any confidence how much memory my hash took up.

Any monks familiar with internals that would like to explain it? I suppose I could wade through the source, but... eek! I'm not a C programmer, either. Thanks!

Addendum / Bonus Question: Do Perl hashes differ in implementation versus "associative arrays" that appear in other languages? Do most other languages implement associative arrays internally in hash tables?

Comment on Perl Internals: Hashes
Re: Perl Internals: Hashes
by Flame (Deacon) on Apr 14, 2004 at 21:03 UTC
    Well, I can't say as I know anything about the internal implementation, but the size consumed by the structure can usually be checked through Devel::Size.




    My code doesn't have bugs, it just develops random features.

    Flame

      Devel::Size also has a good description of how perl hashes work.

        Heh, I admit, I didn't actually look at the docs, I just knew it had that function through reading other nodes. Heh.




        My code doesn't have bugs, it just develops random features.

        Flame

Re: Perl Internals: Hashes
by BrowserUk (Pope) on Apr 14, 2004 at 21:06 UTC

    This picture, which is part of Perlguts Illustrated will give you some clues.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      Gisle's Perlguts Illustrated really is fantastic. It ought to come with the standard distribution.

Re: Perl Internals: Hashes
by demerphq (Chancellor) on Apr 14, 2004 at 21:14 UTC

    Ok, I havent looked at the internals extremely closely but i can give you the following broad run down. Some guru no doubt will come along and paint a better picture later.

    The hash implementation uses chained address hashing with a floating number of buckets that is always a power of 2. Keys are calculated and then the appropriate bits are masked off depending on the number of buckets. When the ratio of items in the hash (its more complex than buckets used, or number of items in a bucket) to buckets used is exceeds some limit (on store) the size of the bucket array is doubled and all the keys are remasked and reassigned. So often the number of buckets can be dramatically larger than the number of keys stored. (You can see this by stringifying a hash.) OTOH, when you use many hashes with long keys that are identical perl actually saves memory by only storing the keys once. All hashes used in perl share their keys, which is one of the reasons hash keys arent actually SV's.

    C:\>perl -le "for (0..9) { print ''.%_; for (($_*10)..(($_+1)*10)) { $ +_{$_}++ } }" 0 10/16 16/32 23/32 34/64 38/64 42/64 57/128 62/128 67/128

    Which shows that the array starts with 16 elements, and then doubles.

    I think the question of memory optimisation versus speed optimization versus speed of development optimiztion is pretty hard to call given what you've said. If you were doing a proof of concept for an algorithm then it shouldnt matter how efficient the underlying tools are i would have thought.

    Regarding your bonus question, almost certainly yes they are different to most other languages. Hashes form a central part of perl-think. A big part of how perl works is implemented via hashes. Most associative arrays ive seen have been tree or list structures. I dont think many other languages implement associative arrays internally as hash tables, but im talking out of my ass when im saying it. :-)


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


      Pardon my pedantics, but
      Which shows that the array starts with 16 elements, and then doubles.
      You've shown that with 10 elements, it's at 16 buckets. Trying this

      c:\>perl -le "for (0..10){print qq($_: ) . %_;$_{$_}++}" 0: 0 1: 1/8 2: 2/8 3: 3/8 4: 3/8 5: 3/8 6: 4/8 7: 5/8 8: 7/16 9: 8/16 10: 9/16
      shows that it starts at 8 buckets, and shifts to 16 buckets at 8 entries (though only 7 buckets have entries).

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Corrections are always welcome. :-)


        ---
        demerphq

          First they ignore you, then they laugh at you, then they fight you, then you win.
          -- Gandhi


Re: Perl Internals: Hashes
by kvale (Monsignor) on Apr 14, 2004 at 21:20 UTC
    First the easy question: hashes are a kind of associative array.

    The hash code can be found in hv.c in the perl distribution. The routine

    HV * Perl_newHV(pTHX) { register HV *hv; register XPVHV* xhv; hv = (HV*)NEWSV(502,0); sv_upgrade((SV *)hv, SVt_PVHV); xhv = (XPVHV*)SvANY(hv); SvPOK_off(hv); SvNOK_off(hv); #ifndef NODEFAULT_SHAREKEYS HvSHAREKEYS_on(hv); /* key-sharing on by default */ #endif xhv->xhv_max = 7; /* HvMAX(hv) = 7 (start with 8 buckets +) */ xhv->xhv_fill = 0; /* HvFILL(hv) = 0 */ xhv->xhv_pmroot = 0; /* HvPMROOT(hv) = 0 */ (void)hv_iterinit(hv); /* so each() will start off right */ return hv; }
    creates a new hash. We can see that the basic memory needed is 502 bytes and intially, 8 buckets are created.

    For growing a hash as elements are added, the hashing algorithm is a one-at-a-time method. To grow the hash, I think the relevant routine is

    STATIC void S_more_he(pTHX) { register HE* he; register HE* heend; XPV *ptr; New(54, ptr, 1008/sizeof(XPV), XPV); ptr->xpv_pv = (char*)PL_he_arenaroot; PL_he_arenaroot = ptr; he = (HE*)ptr; heend = &he[1008 / sizeof(HE) - 1]; PL_he_root = ++he; while (he < heend) { HeNEXT(he) = (HE*)(he + 1); he++; } HeNEXT(he) = 0; }
    which looks like a linear algorithm to me.

    -Mark

      ... hv = (HV*)NEWSV(502,0); ...

      ... creates a new hash. We can see that the basic memory needed is 502 bytes

      That's not correct: the first parameter to NEWSV is an identifier, used only for malloc debugging. It is the second parameter that provides the size, and a value of zero means "give me a new empty SV - I'll fill it in myself".

      It is the following call to sv_upgrade((SV *)hv, SVt_PVHV) that actually turns this into something capable of storing a hash. In particular it allocates the struct xpvhv, defined in hv.h:

      struct xpvhv { char * xhv_array; /* pointer to malloced string */ STRLEN xhv_fill; /* how full xhv_array currently is */ STRLEN xhv_max; /* subscript of last element of xhv_array */ IV xhv_keys; /* how many elements in the array */ NV xnv_nv; /* numeric value, if any */ #define xhv_placeholders xnv_nv MAGIC* xmg_magic; /* magic for scalar array */ HV* xmg_stash; /* class package */ I32 xhv_riter; /* current root of iterator */ HE *xhv_eiter; /* current entry of iterator */ PMOP *xhv_pmroot; /* list of pm's for this package */ char *xhv_name; /* name, if a symbol table */ };

      Also the number of buckets does increase in powers of 2, as shown by the example code in demerphq's response. S_more_he() is not relevant to this - it is (I think) fetching a new structure from a malloc arena. The routine you want is S_hsplit() which quite early on does a nice clear:

      register I32 newsize = oldsize * 2;

      Hugo

Re: Perl Internals: Hashes
by hardburn (Abbot) on Apr 14, 2004 at 21:24 UTC

    Do Perl hashes differ in implementation versus "associative arrays" that appear in other languages?

    Maybe. Looking up key/value pairs doesn't necessarily have to be done by a hash, but it's certainly the first data structure I'd reach for when I needed that sort of functionality (even before I knew a lot of Perl).

    There are a lot of details about hash implementations that will make tradeoffs for various system resources. I doubt any language implements them exactly the same way perl does (and I wouldn't be surprised if many details under Ponie will be different from the traditional perl implementation). In fact, the exact implementation details of hashes has changed many times just in Perl5.

    ----
    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: Perl Internals: Hashes
by ambrus (Abbot) on Apr 14, 2004 at 21:25 UTC

    This is not a full answer, just some points.

    As for the size of a hash, when a new hash with one element is created, it has 8 hash buckets initially. I don't know how it is grown, but you can check it by experimenting: evaluate a hash in scalar context, and you'll get a string containing eg "3/8" if it has 8 hash buckets of which 3 is not empty. As for the actual implementation, I can't say anything, but you can try to read perl's source, esp. hv.c.

    For the bonus question, the C++ standard library implements associative arrays (they call it map) as tree structures, specifically red-black trees. Other implementations then gcc's library can use other kind of trees, but they can't use a hash table as the keys can be anything, and all the functions get about the keys is a comparision function. Ruby uses a hash table. Some lisps have balanced trees too. Gcc has functions for balanced trees too.

    Question: does awk use hash-tables or trees or it depends on version.

    Only slightly related to your question is that the linux kernel itself uses red-black trees for some memory-management functions; it might use hashes too somewhere. Reiserfs has a balanced-tree structure (where the key itself is the hash-key of the filename).

Re: Perl Internals: Hashes
by Kozz (Friar) on Apr 14, 2004 at 21:48 UTC
    Thanks to all for your insight and posts of example code. Very helpful! I know more than I did a few hours ago about Perl. ;)
Re: Perl Internals: Hashes
by flyingmoose (Priest) on Apr 14, 2004 at 22:33 UTC
    A few comments:

    A) Interesting thread and question B) Use the source, Luuuuuuuuuuuuke! C) Isn't using Perl in datastructures class kinda like cheating? :) Datastructures isn't appreciated as an artform unless you are doing it in raw C. My class used a bit of java here and there, and honestly, I got robbed out of my tuition for that one. Fortunately, I'm a lot better at that kind of thing now.

      Regarding C) No. Definately not (IMO anyway). Did the fact that I implemented 2-3 trees (aka red-black trees) in perl rob me of understanding 2-3 trees? Did it with huffman compression? Did it with LZW compression? Treaps? No no, perl frees you from all the BS junk and allows you to narrow down quickly on the algorithm at hand. It may not be as efficient as the same thing coded in C, but its the same algotihm and thus will perform just the same way.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


        Definitely yes -- if you are avoiding the intracies of pointer manipulation. True that trees as such can be aided by higher level languages, but there is nuance and skill learned in finer manipulation (though it be drudgery) that will allow you to perform far greater things in other languages when Perl is not always available.

        Don't do your data structures work in Perl if you want to really get something from the class. If you are just trying to avoid drudgery, well yeah, Perl helps. But do it ground up at least once, and you'll come out a changed developer.

      C? Bah!, that's making it too easy. Write algorithms in MIX, the way God and Knuth intended it!

      More seriously, perl is an excellent language for learning algorithms. One can get to the crux of the algorithm without fussing with memory management and other low-level irritations. The resulting simpler, cleaner code allows one to see what is going on. And the elimination of a separate compile step allows one to play with code in a more exploratory, interactive fashion. The book Mastering Algorithms with Perl is a fine example of how fun and easy learning algorithms can be.

      I also think that having a hash as a fundamental part of the language makes a big difference in the ease of designing and implementing algorithms. Set operations become simple, as do tree and graph algorithms. There was an interesting study by Lutz Prechelt comparing programming efficiency across a wide range of languages. In it there was a text-processing task that was a natural for hashes. Despite both languages having hashes available through libraries, the C++ and Java folks tended to instead write huge amounts of code implementing n-ary trees, etc. instead of taking the easy way. Unless speed or memory limitations prevent it, it is generally a good idea to use the high-level tools you you have at hand.

      -Mark

        C? Bah!, that's making it too easy. Write algorithms in MIX, the way God and Knuth intended it!

        Good ole GNU to the rescue, here is a MIX SDK. Now all we need is someone to write Inline::MIX ;-P

        -stvn
Re: Perl Internals: Hashes
by bl0rf (Pilgrim) on Apr 15, 2004 at 02:03 UTC
Re: Perl Internals: Hashes
by Aragorn (Curate) on Apr 15, 2004 at 06:43 UTC

      I completely agree with aragorn, Simon's document taught me to be a better perl programmer too, knowing what is going on under-the-hood has helped in making some over-the-hood descisions.

      And if you are interested in seeing the datastructures Simon describes in real-life. I would recommend exploring the B modules, which Simon touches on at the end of his document. I asked the monostarty about the B modules a little while back, and got some good answers and links here.

      -stvn
Re: Perl Internals: Hashes
by gnork (Scribe) on Apr 15, 2004 at 08:20 UTC
    I wondered for a long time why
    sub something { my %hash = @_ print $hash{1} . "\n"; } %test; $test{1} = 1; something(%test);
    works like a charme. Is it some internal magic or simply that hashes and named arrays are implemented not that different?

    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
    errors->(c)

      The parameter part of a subroutine call is normally a list. If a hash or an array gets put in one they are listified. In the case of a hash this means a list of key/value pairs. Likewise, if an array contains an even number of elements you can "pour" it into a hash and the elements will be paired off into key/value slots.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


      To expound on demerphq's response with an example:

      Given a hash, like so:

      my %test; $test{a} = 1; $test{b} = 2;
      When you call a subroutine with an array or hash, it gets flattened into a list. For a hash, that means giving all of its keys and corresponding values. So this:
      something(%test);
      is the same as this:
      something(a, 1, b, 2);
      (Note that when hashes are flattened, the keys may come out in any order, so it could have also been something(b, 2, a, 1);.)

      The other side of this situation is that you can fill a hash directly with a list, instead of doing each key one-at-a-time. When you assign to a hash with a list, each odd item becomes a key, and the corresponding even item is its value. So we could have filled the hash like this:

      %test = (a, 1, b, 2);
      and it would be exactly the same. Note that this is also the same as the more familiar:
      %test = (a => 1, b => 2);
      as the => is just a type of comma that auto-quotes its left side operand. Now you can see what happens:
      something(%test); # is like something(a, 1, b, 2);
      And inside the sub:
      my %hash = @_; # is like %hash = (a, 1, b, 2);
      And so it turns out that you're just doing a basic copy. The hash inside the sub will be a copy of the one you passed in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://345200]
Approved by diotalevi
Front-paged by Thelonius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-12-18 01:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (41 votes), past polls