|Problems? Is your data what you think it is?|
Hi folks. Recent meditations on sorted hash implementations reminded me of a module that I hacked together about a year ago and never did much with. I dug it out and did a little bit of extra stuff on it to make it complete, wrote some tests for it (not enough as yet alas) and now i'm presenting here. Partially becuase I think some people might find it interesting, partially because I want some critique.
I actually debated posting it to SOPW, but as I think the module illustrates a number of interesting concepts, and is also quite large having it in meditations means it can be altered and changes posted to it without having to repost the entire thing. (Also I have lost some of the links that I should include so until I find them I want the node editable.)/me can hear people saying
Interesting concepts!? Thats a mess, not an interesting concept.
Well, *ahem*, yes. :-) I wrote this module as a learning excerise. I wanted to understand how the "Random Treap" algorithm worked. You see I have this weakness for self organizing data structures, especially tree based ones. Binary tress, 2-3 trees, huffman trees, B+Trees, etc all fascinate me. And when I discover a new one and can find the time I like to implement them to really grok how they work. In fact when I learn a new language I tend to make doing a 2-3 tree implementation one of the first pieces of code that I write. Its a tricky but cool algorithm that I find requires you to become aquainted with most aspects of the core of a language in a single project. And so when I found out about treaps1 as soon as I had time I downloaded a good paper on them1 and started coding. At the same time I was playing with other ideas, such as custom imports and namespace games, as well as being extremely unhappy with the interface of the well known Heap:: modules of "Mastering Algorithms in Perl". So naturally I rolled them all into one.
Whats a treap?
Before we get to details of the code, lets briefly disucss what a Treap is. To grok Treaps you will need to think a moment about a tree organized in heaporder, and a tree organized in inorder.
A tree in heap order is one where the node invariant is that the parents value is in some relation to both children. For instance if the ordering is from lowest to highest then the relation is such that the parent is less than either of its children, regardless of which side. This means there is no direct relationship between the two children, except that they are both greater than the parent.
A tree that is organized in inorder maintains a different relationship between the parent and each child, and therefore an implied direct relationship between each child as well. In the case where inorder is defined from lowest to highest, the left child shall be less than the parent and the parent will be less than the right child. This of course implies that the left child is less than the right.
Now, a Treap is a data structure where each node has two values, and where the resulting tree is maintained InOrder by one value and in HeapOrder by the other. For convenience we will use letters for the InOrder relationship and numbers for the HeapOrder relationship. The insert procedure is actually quite simple. We insert the node as a leaf just as we would if it were a binary tree obeying only the InOrder rule. Once inserted as a leaf we rotate the node up the tree until heap order is restored. If the node as inserted is already in heap order we do nothing extra. Similarly if we want to delete a node, we simply give it a temporary infinite heap weight, and rotate it down the tree in such a way that the heap order of the tree is maintained until its a leaf node and then delete it. (The normal binary tree trick of exchanging with the predecessor or successor if its an internal node can also be used so long as the new node is then rotated into heap order afterwards.)I mentioned Node rotation which is a common concept which could be discussed in considerable depth (and indeed is on many sites1) but should just be touched on for completeness. We can always reorganize a binary tree structure that is InOrder by making the parent the child of one of its children, and moving the other child up to take its place in the tree. This "rotation" maintains the ordering of the tree, and can usually be performed in a minimal number of steps. I illustrate the two rotations here:
Threaded Trees and Parent Pointers
There are a number of interesting extensions to the standard binary tree strucutres that can be employed quite usefully to make implementing tree based Tie::Hash interfaces more efficient and easier to code. The two variants that I played with are Threading, And Left-Right Parent pointers. Threading involves adding a flag or two to the data structure and then reusing null child pointers. Lets take right threading as an example. Right threading means we add a flag to the node indicating if the right child pointer in fact points to a child or points to the successor to the node. When an insert occurs it always happens at a leaf, so when the new pointer is being assigned the old value is passed down to the new node. This means iterating the tree inorder can be done iteratively by simply following the right child. Left threading works similarly, except the predecessor is stored in unused left references instead of the successor. Originally my Treap implmentation was fully threaded. I then converted it to use Left-Right Parent pointers.
Left-Right Parent pointers are a different mechanism. In this case two additional references are stored per node, the so called Left and Right parent pointers. To understand this properly let us define an extra term, "the left|right spine" of a node as being the node itself and all nodes that can be reached by traversing only left|right from it respectively. A node is the left parent of its right childs left spine, and the right parent of its lefts childs right spine. Thus in the earlier InOrder example B is the right parent of A and the left parent of C, A's left parent is null, and C's right parent is null. This relationship may sound a bit odd, but it does allow the same features as threading, plus allows some interesting possibilities in searching (finger searches). This is the model I ended up using.
So whats the point
Well it turns out that Treaps have some interesting possibilites. For any given set of key-value pairs there is only one Treap that can store them. More interestingly by setting the value to a random value on store means that the search time for keys becomes near optimal, in effect the tree stays nearly balanced, making for efficient lookups. Maintaining this balance is on average cheaper in terms of rotations than many other self balancing trees. This can be a useful property when the tree structure is hybrid and rotations can incur expensive additional cost due to outside factors. A class called Random::Treap is included in the module that provides a sorted tied hash implementation that uses this technique.
The Heap API sucks...
The heap API chews big time IMO. I've never like using it, I dont like to have to subclass its nodes to store what I need. It's just too clumsy and counter intuitive and frankly just too much work for my taste. So I wanted something different. Something where I could put a little elbow grease in just the one time, and then have an easy time of it ever after. I believe thats called Lazyness. :-) So what I did was predetermine the nodes structure that would be used for all implementations of the tree. Instead of subclassing the nodes you subclass the base class. There are three methods that need to be overriden to change the ordering of the tree (there probably should only be two). They are
To change the behaviour of the algorithm you subclass it and override those methods as appropriate and you are done. In hindsight requiring only two methods, heap_cmp and key_cmp would have been better and I'll probably change it at some point. I dont remember the reason I went this way now, probably that it made the code a little easier to deal with.
Custom Import and Name Space Games
Package namespace games get involved because I think that Algorithm::Treap is the correct name for the module in terms of the CPAN namespace heirarchy, but I dont like to have to write Algorithm more than once in a program. Id much rather be able to talk about Treap's and not Algorithm::Treap's. So the package looks like this:
This means that when you use Algorithm::Treap you are actually using Treap, but perl knows to look in a different place. This can be a handy trick for dealing with annoyingly long package names while still having a deep file heirarchy and not a flat one.
Now we get to the custom import bit. One of the things I wanted was an easier way to do the subclassing. I also was playing around with losing the method call to do the ordering of the class (in what I recall was an attempt at additional efficiency. I now see this argument as probably misguided, and probably a form of premature optimization. But the technique is interesting and useful so Ive left it in for now. Plus it works. And you know what they say... :-)
The idea is that import() acts as a source code filter and class factory. When you use the module with no options it simply loads as would any module. However if you give it arguments you are actually asking it to generate a subclass for you that doesnt use method calls for determining the order functions. Instead it uses the code you provide to modify the class as a template. (It actually does override the methods too, but they are only for utility purposes.) Thus any subclasses generated this way no longer have the ability to have their ordering behaviour overriden, as the relevent method calls have been replaced by hardcoded evaluations. While this is maybe not the best use of this technique, it was an interesting experience putting it together.
So how do we use it?
The easiest way is to use the tied hash interace, which can be easily obtained by using the class method Tied_Hash, which returns a reference to a tied hash of the appropriate class. The base class keeps the keys of the hash InOrder and the values of the hash in HeapOrder, and expects the values to be numeric.
A more interesting example is the Random::Treap class which keeps the keys inorder, and provides for random values on store for the heaporder. The values can then be anything.
And then theres the class factory interface:
Other uses are available through sub classing directly etc.
Drawing trees and List Ordering
Im a visual person and when I code something like this I like to be able to see whats going on. So there are a number of methods that are available for getting a look see as to what happening under the hood. The first is the dump_tree() method which will in void context print out the treap in ascii form, or in non-void return a string representation of it. For instance the $default hash can be viewed by
Review the code and youll find a few other ways to look at things too.
You can iterate through the nodes efficiently by getting a node from the treap then calling succ or pred on the node returned as required. The nodes have the methods key, weight, and value, which can be used. It is not correct to change key or weight directly, but value can modified as you desire. This means that you can have multiple iterators on the same data structure simultaneously without them interfering with each other. The treap object however, in order to meet the requirementes of the Tie::Hash interface only allows one iterator.
I really aught to take the time to fully document this module and upload it to CPAN. For now, I hope the monastery finds it interesting... And can offer some constructive criticism for its improvement.
Code follows in a reply, as this node itself has grown quite large. :-)
 I'll follow these up with links as I can find them again.
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...