Re: Perl Internals: Hashes
by BrowserUk (Patriarch) on Apr 14, 2004 at 21:06 UTC
|
This picture, which is part of Perlguts Illustrated will give you some clues.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by demerphq (Chancellor) on Apr 14, 2004 at 21:14 UTC
|
Ok, I havent looked at the internals extremely closely but i can give you the following broad run down. Some guru no doubt will come along and paint a better picture later.
The hash implementation uses chained address hashing with a floating number of buckets that is always a power of 2. Keys are calculated and then the appropriate bits are masked off depending on the number of buckets. When the ratio of items in the hash (its more complex than buckets used, or number of items in a bucket) to buckets used is exceeds some limit (on store) the size of the bucket array is doubled and all the keys are remasked and reassigned. So often the number of buckets can be dramatically larger than the number of keys stored. (You can see this by stringifying a hash.) OTOH, when you use many hashes with long keys that are identical perl actually saves memory by only storing the keys once. All hashes used in perl share their keys, which is one of the reasons hash keys arent actually SV's.
C:\>perl -le "for (0..9) { print ''.%_; for (($_*10)..(($_+1)*10)) { $
+_{$_}++ } }"
0
10/16
16/32
23/32
34/64
38/64
42/64
57/128
62/128
67/128
Which shows that the array starts with 16 elements, and then doubles.
I think the question of memory optimisation versus speed optimization versus speed of development optimiztion is pretty hard to call given what you've said. If you were doing a proof of concept for an algorithm then it shouldnt matter how efficient the underlying tools are i would have thought.
Regarding your bonus question, almost certainly yes they are different to most other languages. Hashes form a central part of perl-think. A big part of how perl works is implemented via hashes. Most associative arrays ive seen have been tree or list structures. I dont think many other languages implement associative arrays internally as hash tables, but im talking out of my ass when im saying it. :-)
---
demerphq
First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] [d/l] |
Re: Perl Internals: Hashes
by kvale (Monsignor) on Apr 14, 2004 at 21:20 UTC
|
HV *
Perl_newHV(pTHX)
{
register HV *hv;
register XPVHV* xhv;
hv = (HV*)NEWSV(502,0);
sv_upgrade((SV *)hv, SVt_PVHV);
xhv = (XPVHV*)SvANY(hv);
SvPOK_off(hv);
SvNOK_off(hv);
#ifndef NODEFAULT_SHAREKEYS
HvSHAREKEYS_on(hv); /* key-sharing on by default */
#endif
xhv->xhv_max = 7; /* HvMAX(hv) = 7 (start with 8 buckets
+) */
xhv->xhv_fill = 0; /* HvFILL(hv) = 0 */
xhv->xhv_pmroot = 0; /* HvPMROOT(hv) = 0 */
(void)hv_iterinit(hv); /* so each() will start off right */
return hv;
}
creates a new hash. We can see that the basic memory needed is 502 bytes and intially, 8 buckets are created.
For growing a hash as elements are added, the hashing algorithm is a one-at-a-time method. To grow the hash, I think the relevant routine is
STATIC void
S_more_he(pTHX)
{
register HE* he;
register HE* heend;
XPV *ptr;
New(54, ptr, 1008/sizeof(XPV), XPV);
ptr->xpv_pv = (char*)PL_he_arenaroot;
PL_he_arenaroot = ptr;
he = (HE*)ptr;
heend = &he[1008 / sizeof(HE) - 1];
PL_he_root = ++he;
while (he < heend) {
HeNEXT(he) = (HE*)(he + 1);
he++;
}
HeNEXT(he) = 0;
}
which looks like a linear algorithm to me.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
... hv = (HV*)NEWSV(502,0); ...
... creates a new hash. We can see that the basic memory needed is 502 bytes
That's not correct: the first parameter to NEWSV is an identifier, used only for malloc debugging. It is the second parameter that provides the size, and a value of zero means "give me a new empty SV - I'll fill it in myself".
It is the following call to sv_upgrade((SV *)hv, SVt_PVHV) that actually turns this into something capable of storing a hash. In particular it allocates the struct xpvhv, defined in hv.h:
struct xpvhv {
char * xhv_array; /* pointer to malloced string */
STRLEN xhv_fill; /* how full xhv_array currently is */
STRLEN xhv_max; /* subscript of last element of xhv_array */
IV xhv_keys; /* how many elements in the array */
NV xnv_nv; /* numeric value, if any */
#define xhv_placeholders xnv_nv
MAGIC* xmg_magic; /* magic for scalar array */
HV* xmg_stash; /* class package */
I32 xhv_riter; /* current root of iterator */
HE *xhv_eiter; /* current entry of iterator */
PMOP *xhv_pmroot; /* list of pm's for this package */
char *xhv_name; /* name, if a symbol table */
};
Also the number of buckets does increase in powers of 2, as shown by the example code in demerphq's response. S_more_he() is not relevant to this - it is (I think) fetching a new structure from a malloc arena. The routine you want is S_hsplit() which quite early on does a nice clear: register I32 newsize = oldsize * 2;
Hugo | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Perl Internals: Hashes
by Flame (Deacon) on Apr 14, 2004 at 21:03 UTC
|
Well, I can't say as I know anything about the internal implementation, but the size consumed by the structure can usually be checked through Devel::Size.
My code doesn't have bugs, it just develops random features.
Flame
| [reply] [Watch: Dir/Any] |
|
Devel::Size also has a good description of how perl hashes work.
| [reply] [Watch: Dir/Any] |
|
Heh, I admit, I didn't actually look at the docs, I just knew it had that function through reading other nodes. Heh.
My code doesn't have bugs, it just develops random features.
Flame
| [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by ambrus (Abbot) on Apr 14, 2004 at 21:25 UTC
|
This is not a full answer, just some points.
As for the size of a hash, when a new hash with one
element is created, it has 8 hash buckets initially.
I don't know how it is grown, but you can check it
by experimenting: evaluate a hash in scalar
context, and you'll get a string containing eg "3/8" if it has
8 hash buckets of which 3 is not empty.
As for the actual implementation, I can't say anything,
but you can try to read perl's source, esp. hv.c.
For the bonus question, the C++ standard library
implements associative arrays (they call it map)
as tree structures, specifically red-black trees.
Other implementations then gcc's library can use other kind
of trees, but they can't use a hash table as
the keys can be anything, and all the functions
get about the keys is a comparision function.
Ruby uses a hash table.
Some lisps have balanced trees too.
Gcc has functions for balanced trees too.
Question: does awk use hash-tables or trees or it depends on version.
Only slightly related to your question is that the linux kernel itself uses red-black trees for some
memory-management functions; it might use hashes too somewhere.
Reiserfs has a balanced-tree structure (where the key itself
is the hash-key of the filename).
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Perl Internals: Hashes
by hardburn (Abbot) on Apr 14, 2004 at 21:24 UTC
|
Do Perl hashes differ in implementation versus "associative arrays" that appear in other languages?
Maybe. Looking up key/value pairs doesn't necessarily have to be done by a hash, but it's certainly the first data structure I'd reach for when I needed that sort of functionality (even before I knew a lot of Perl).
There are a lot of details about hash implementations that will make tradeoffs for various system resources. I doubt any language implements them exactly the same way perl does (and I wouldn't be surprised if many details under Ponie will be different from the traditional perl implementation). In fact, the exact implementation details of hashes has changed many times just in Perl5.
----
: () { :|:& };:
Note: All code is untested, unless otherwise stated
| [reply] [Watch: Dir/Any] [d/l] |
Re: Perl Internals: Hashes
by bl0rf (Pilgrim) on Apr 15, 2004 at 02:03 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by Aragorn (Curate) on Apr 15, 2004 at 06:43 UTC
|
| [reply] [Watch: Dir/Any] |
|
I completely agree with aragorn, Simon's document taught me to be a better perl programmer too, knowing what is going on under-the-hood has helped in making some over-the-hood descisions.
And if you are interested in seeing the datastructures Simon describes in real-life. I would recommend exploring the B modules, which Simon touches on at the end of his document. I asked the monostarty about the B modules a little while back, and got some good answers and links here.
| [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by Kozz (Friar) on Apr 14, 2004 at 21:48 UTC
|
Thanks to all for your insight and posts of example code. Very helpful! I know more than I did a few hours ago about Perl. ;) | [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by flyingmoose (Priest) on Apr 14, 2004 at 22:33 UTC
|
A few comments:
A) Interesting thread and question
B) Use the source, Luuuuuuuuuuuuke!
C) Isn't using Perl in datastructures class kinda like cheating? :) Datastructures isn't appreciated as an artform unless you are doing it in raw C. My class used a bit of java here and there, and honestly, I got robbed out of my tuition for that one. Fortunately, I'm a lot better at that kind of thing now.
| [reply] [Watch: Dir/Any] |
|
Regarding C) No. Definately not (IMO anyway). Did the fact that I implemented 2-3 trees (aka red-black trees) in perl rob me of understanding 2-3 trees? Did it with huffman compression? Did it with LZW compression? Treaps? No no, perl frees you from all the BS junk and allows you to narrow down quickly on the algorithm at hand. It may not be as efficient as the same thing coded in C, but its the same algotihm and thus will perform just the same way.
---
demerphq
First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi
| [reply] [Watch: Dir/Any] [d/l] |
|
Definitely yes -- if you are avoiding the intracies of pointer manipulation. True that trees as such can be aided by higher level languages, but there is nuance and skill learned in finer manipulation (though it be drudgery) that will allow you to perform far greater things in other languages when Perl is not always available.
Don't do your data structures work in Perl if you want to really get something from the class. If you are just trying to avoid drudgery, well yeah, Perl helps. But do it ground up at least once, and you'll come out a changed developer.
| [reply] [Watch: Dir/Any] |
|
|
|
|
C? Bah!, that's making it too easy. Write algorithms in MIX, the way God and Knuth intended it!
More seriously, perl is an excellent language for learning algorithms. One can get to the crux of the algorithm without fussing with memory management and other low-level irritations. The resulting simpler, cleaner code allows one to see what is going on. And the elimination of a separate compile step allows one to play with code in a more exploratory, interactive fashion. The book
Mastering Algorithms with Perl is a fine example of how fun and easy learning algorithms can be.
I also think that having a hash as a fundamental part of the language makes a big difference in the ease of designing and implementing algorithms. Set operations become simple, as do tree and graph algorithms. There was an interesting study by Lutz Prechelt comparing programming efficiency across a wide range of languages. In it there was a text-processing task that was a natural for hashes. Despite both languages having hashes available through libraries, the C++ and Java folks tended to instead write huge amounts of code implementing n-ary trees, etc. instead of taking the easy way. Unless speed or memory limitations prevent it, it is generally a good idea to use the high-level tools you you have at hand.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Perl Internals: Hashes
by gnork (Scribe) on Apr 15, 2004 at 08:20 UTC
|
I wondered for a long time why
sub something {
my %hash = @_
print $hash{1} . "\n";
}
%test;
$test{1} = 1;
something(%test);
works like a charme. Is it some internal magic or simply that hashes and named arrays are implemented not that different?
cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
errors->(c)
| [reply] [Watch: Dir/Any] [d/l] |
|
The parameter part of a subroutine call is normally a list. If a hash or an array gets put in one they are listified. In the case of a hash this means a list of key/value pairs. Likewise, if an array contains an even number of elements you can "pour" it into a hash and the elements will be paired off into key/value slots.
---
demerphq
First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi
| [reply] [Watch: Dir/Any] [d/l] |
|
my %test;
$test{a} = 1;
$test{b} = 2;
When you call a subroutine with an array or hash, it gets flattened into a list. For a hash, that means giving all of its keys and corresponding values. So this:
something(%test);
is the same as this:
something(a, 1, b, 2);
(Note that when hashes are flattened, the keys may come out in any order, so it could have also been something(b, 2, a, 1);.)
The other side of this situation is that you can fill a hash directly with a list, instead of doing each key one-at-a-time. When you assign to a hash with a list, each odd item becomes a key, and the corresponding even item is its value. So we could have filled the hash like this:
%test = (a, 1, b, 2);
and it would be exactly the same. Note that this is also the same as the more familiar:
%test = (a => 1, b => 2);
as the => is just a type of comma that auto-quotes its left side operand.
Now you can see what happens:
something(%test); # is like something(a, 1, b, 2);
And inside the sub:
my %hash = @_; # is like %hash = (a, 1, b, 2);
And so it turns out that you're just doing a basic copy. The hash inside the sub will be a copy of the one you passed in.
| [reply] [Watch: Dir/Any] [d/l] [select] |