http://www.perlmonks.org?node_id=289584

Hi folks. Recent meditations on sorted hash implementations reminded me of a module that I hacked together about a year ago and never did much with. I dug it out and did a little bit of extra stuff on it to make it complete, wrote some tests for it (not enough as yet alas) and now i'm presenting here. Partially becuase I think some people might find it interesting, partially because I want some critique.

I actually debated posting it to SOPW, but as I think the module illustrates a number of interesting concepts, and is also quite large having it in meditations means it can be altered and changes posted to it without having to repost the entire thing. (Also I have lost some of the links that I should include so until I find them I want the node editable.)

/me can hear people saying

Interesting concepts!? Thats a mess, not an interesting concept.

Well, *ahem*, yes. :-) I wrote this module as a learning excerise. I wanted to understand how the "Random Treap" algorithm worked. You see I have this weakness for self organizing data structures, especially tree based ones. Binary tress, 2-3 trees, huffman trees, B+Trees, etc all fascinate me. And when I discover a new one and can find the time I like to implement them to really grok how they work. In fact when I learn a new language I tend to make doing a 2-3 tree implementation one of the first pieces of code that I write. Its a tricky but cool algorithm that I find requires you to become aquainted with most aspects of the core of a language in a single project. And so when I found out about treaps1 as soon as I had time I downloaded a good paper on them1 and started coding. At the same time I was playing with other ideas, such as custom imports and namespace games, as well as being extremely unhappy with the interface of the well known Heap:: modules of "Mastering Algorithms in Perl". So naturally I rolled them all into one.

Whats a treap?

Before we get to details of the code, lets briefly disucss what a Treap is. To grok Treaps you will need to think a moment about a tree organized in heaporder, and a tree organized in inorder.

HeapOrderInOrder
1
23
B
AC

A tree in heap order is one where the node invariant is that the parents value is in some relation to both children. For instance if the ordering is from lowest to highest then the relation is such that the parent is less than either of its children, regardless of which side. This means there is no direct relationship between the two children, except that they are both greater than the parent.

A tree that is organized in inorder maintains a different relationship between the parent and each child, and therefore an implied direct relationship between each child as well. In the case where inorder is defined from lowest to highest, the left child shall be less than the parent and the parent will be less than the right child. This of course implies that the left child is less than the right.

Now, a Treap is a data structure where each node has two values, and where the resulting tree is maintained InOrder by one value and in HeapOrder by the other. For convenience we will use letters for the InOrder relationship and numbers for the HeapOrder relationship. The insert procedure is actually quite simple. We insert the node as a leaf just as we would if it were a binary tree obeying only the InOrder rule. Once inserted as a leaf we rotate the node up the tree until heap order is restored. If the node as inserted is already in heap order we do nothing extra. Similarly if we want to delete a node, we simply give it a temporary infinite heap weight, and rotate it down the tree in such a way that the heap order of the tree is maintained until its a leaf node and then delete it. (The normal binary tree trick of exchanging with the predecessor or successor if its an internal node can also be used so long as the new node is then rotated into heap order afterwards.)

I mentioned Node rotation which is a common concept which could be discussed in considerable depth (and indeed is on many sites1) but should just be touched on for completeness. We can always reorganize a binary tree structure that is InOrder by making the parent the child of one of its children, and moving the other child up to take its place in the tree. This "rotation" maintains the ordering of the tree, and can usually be performed in a minimal number of steps. I illustrate the two rotations here:

Left RotationRight Rotation
B
AD
  CE
D
BE
AC  
D
BE
AC  
B
AD
  CE

Threaded Trees and Parent Pointers

There are a number of interesting extensions to the standard binary tree strucutres that can be employed quite usefully to make implementing tree based Tie::Hash interfaces more efficient and easier to code. The two variants that I played with are Threading, And Left-Right Parent pointers. Threading involves adding a flag or two to the data structure and then reusing null child pointers. Lets take right threading as an example. Right threading means we add a flag to the node indicating if the right child pointer in fact points to a child or points to the successor to the node. When an insert occurs it always happens at a leaf, so when the new pointer is being assigned the old value is passed down to the new node. This means iterating the tree inorder can be done iteratively by simply following the right child. Left threading works similarly, except the predecessor is stored in unused left references instead of the successor. Originally my Treap implmentation was fully threaded. I then converted it to use Left-Right Parent pointers.

Left-Right Parent pointers are a different mechanism. In this case two additional references are stored per node, the so called Left and Right parent pointers. To understand this properly let us define an extra term, "the left|right spine" of a node as being the node itself and all nodes that can be reached by traversing only left|right from it respectively. A node is the left parent of its right childs left spine, and the right parent of its lefts childs right spine. Thus in the earlier InOrder example B is the right parent of A and the left parent of C, A's left parent is null, and C's right parent is null. This relationship may sound a bit odd, but it does allow the same features as threading, plus allows some interesting possibilities in searching (finger searches). This is the model I ended up using.

So whats the point

Well it turns out that Treaps have some interesting possibilites. For any given set of key-value pairs there is only one Treap that can store them. More interestingly by setting the value to a random value on store means that the search time for keys becomes near optimal, in effect the tree stays nearly balanced, making for efficient lookups. Maintaining this balance is on average cheaper in terms of rotations than many other self balancing trees. This can be a useful property when the tree structure is hybrid and rotations can incur expensive additional cost due to outside factors. A class called Random::Treap is included in the module that provides a sorted tied hash implementation that uses this technique.

The Heap API sucks...

The heap API chews big time IMO. I've never like using it, I dont like to have to subclass its nodes to store what I need. It's just too clumsy and counter intuitive and frankly just too much work for my taste. So I wanted something different. Something where I could put a little elbow grease in just the one time, and then have an easy time of it ever after. I believe thats called Lazyness. :-) So what I did was predetermine the nodes structure that would be used for all implementations of the tree. Instead of subclassing the nodes you subclass the base class. There are three methods that need to be overriden to change the ordering of the tree (there probably should only be two). They are

sub heap_lt { $_[1] < $_[2] }; sub key_lt { $_[1] lt $_[2] }; sub key_gt { $_[1] gt $_[2] };

To change the behaviour of the algorithm you subclass it and override those methods as appropriate and you are done. In hindsight requiring only two methods, heap_cmp and key_cmp would have been better and I'll probably change it at some point. I dont remember the reason I went this way now, probably that it made the code a little easier to deal with.

Custom Import and Name Space Games

Package namespace games get involved because I think that Algorithm::Treap is the correct name for the module in terms of the CPAN namespace heirarchy, but I dont like to have to write Algorithm more than once in a program. Id much rather be able to talk about Treap's and not Algorithm::Treap's. So the package looks like this:

package Algorithm::Treap; sub import { goto &Treap::import } 1; package Treap; #...

This means that when you use Algorithm::Treap you are actually using Treap, but perl knows to look in a different place. This can be a handy trick for dealing with annoyingly long package names while still having a deep file heirarchy and not a flat one.

Now we get to the custom import bit. One of the things I wanted was an easier way to do the subclassing. I also was playing around with losing the method call to do the ordering of the class (in what I recall was an attempt at additional efficiency. I now see this argument as probably misguided, and probably a form of premature optimization. But the technique is interesting and useful so Ive left it in for now. Plus it works. And you know what they say... :-)

The idea is that import() acts as a source code filter and class factory. When you use the module with no options it simply loads as would any module. However if you give it arguments you are actually asking it to generate a subclass for you that doesnt use method calls for determining the order functions. Instead it uses the code you provide to modify the class as a template. (It actually does override the methods too, but they are only for utility purposes.) Thus any subclasses generated this way no longer have the ability to have their ordering behaviour overriden, as the relevent method calls have been replaced by hardcoded evaluations. While this is maybe not the best use of this technique, it was an interesting experience putting it together.

So how do we use it?

The easiest way is to use the tied hash interace, which can be easily obtained by using the class method Tied_Hash, which returns a reference to a tied hash of the appropriate class. The base class keeps the keys of the hash InOrder and the values of the hash in HeapOrder, and expects the values to be numeric.

use Algorithm::Treap; my $default=Treap->Tied_Hash(A => 121, B => 674, C => 970, D => 82, E +=> 658, F => 957); $default->{D}=600; $default->{A}=500;

A more interesting example is the Random::Treap class which keeps the keys inorder, and provides for random values on store for the heaporder. The values can then be anything.

use Algorithm::Treap; my $rand=Random::Treap->Tied_Hash(); $rand->{ja}="just another\n"; $rand->{ph}="perl hacker\n";

And then theres the class factory interface:

use Algorithm::Treap IntKey => #DEBUG=>1, key_lt => '$1 < $2', key_gt => '$1 > $2', heap_lt => '$1 < $2', ; my $inttreap=Treap::IntKey->Tied_Hash(); $inttreap->{1}=10; $inttreap->{2}=5; $inttreap->{10}=5; print join(", ",keys %$inttreap),"\n";

Other uses are available through sub classing directly etc.

Drawing trees and List Ordering

Im a visual person and when I code something like this I like to be able to see whats going on. So there are a number of methods that are available for getting a look see as to what happening under the hood. The first is the dump_tree() method which will in void context print out the treap in ascii form, or in non-void return a string representation of it. For instance the $default hash can be viewed by

tied(%$default)->dump_tree;

Which outputs

A=500- +--------------------+ -D=600- +-------------+------+ B=674- E=658- +------+ +------+ C=970 F=957

Review the code and youll find a few other ways to look at things too.

Iteration

You can iterate through the nodes efficiently by getting a node from the treap then calling succ or pred on the node returned as required. The nodes have the methods key, weight, and value, which can be used. It is not correct to change key or weight directly, but value can modified as you desire. This means that you can have multiple iterators on the same data structure simultaneously without them interfering with each other. The treap object however, in order to meet the requirementes of the Tie::Hash interface only allows one iterator.

Conclusion

I really aught to take the time to fully document this module and upload it to CPAN. For now, I hope the monastery finds it interesting... And can offer some constructive criticism for its improvement.

Code follows in a reply, as this node itself has grown quite large. :-)

[1] I'll follow these up with links as I can find them again.


---
demerphq

<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Replies are listed 'Best First'.
Algorithm::Treap (code)
by demerphq (Chancellor) on Sep 07, 2003 at 15:21 UTC

    This is the code for Algorithm::Treap. If you run it as perl treap.pm then it will self test/demonstrate. Also included is a test snippet that I have been using.

    This should be installed as ./Algorithm/Treap.pm somewhere reachable by @INC. (I set PERL5LIB to a development module tree for stuff I work on, and this is where this module lives. Warning: Large.

    Test script for the module. This should be runnable if the Algorithm treap is correctly located in @INC.

    :-)


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
      Uh oh (what's going on?):
      perl treap.pm
      Subroutine import redefined at Treap.pm line 67.
      Subroutine heap_lt redefined at Treap.pm line 127.
      Subroutine key_lt redefined at Treap.pm line 128.
      Subroutine key_gt redefined at Treap.pm line 129.
      Subroutine new redefined at Treap.pm line 131.
      Subroutine find_path_to_node redefined at Treap.pm line 184.
      Subroutine find_node redefined at Treap.pm line 209.
      Subroutine _shift_up redefined at Treap.pm line 241.
      Subroutine _shift_down redefined at Treap.pm line 316.
      Subroutine Store redefined at Treap.pm line 410.
      Subroutine Delete redefined at Treap.pm line 484.
      Subroutine Exists redefined at Treap.pm line 517.
      Subroutine Clear redefined at Treap.pm line 526.
      Subroutine Firstkey redefined at Treap.pm line 551.
      Subroutine Nextkey redefined at Treap.pm line 565.
      Subroutine Fetch redefined at Treap.pm line 579.
      Subroutine FetchUser redefined at Treap.pm line 586.
      Subroutine DESTROY redefined at Treap.pm line 598.
      Subroutine count redefined at Treap.pm line 611.
      Subroutine left redefined at Treap.pm line 614.
      Subroutine right redefined at Treap.pm line 617.
      Subroutine root redefined at Treap.pm line 620.
      Subroutine _sub_as_list redefined at Treap.pm line 622.
      Subroutine breadth_first redefined at Treap.pm line 631.
      Subroutine in_order redefined at Treap.pm line 676.
      Subroutine rev_order redefined at Treap.pm line 690.
      Subroutine heap_order redefined at Treap.pm line 705.
      Subroutine _heap_order_recurse redefined at Treap.pm line 715.
      Subroutine top redefined at Treap.pm line 764.
      Subroutine extract_top redefined at Treap.pm line 774.
      Subroutine print_order redefined at Treap.pm line 788.
      Subroutine __dump redefined at Treap.pm line 810.
      Subroutine dump_vert redefined at Treap.pm line 817.
      Subroutine __center redefined at Treap.pm line 821.
      Subroutine dump_tree redefined at Treap.pm line 840.
      Subroutine Tied_Hash redefined at Treap.pm line 913.
      Subroutine FETCH redefined at treap.pm line 924.
      Subroutine STORE redefined at treap.pm line 929.
      function 'new' already defined in package Treap::_Node at treap.pm line 950
      BEGIN failed--compilation aborted at treap.pm line 958.
      

        Uh oh (what's going on?):

        I dont know. This doesnt make sense to me. Obviously it doesnt happen here, and at least one other person has had the code pass all tests. Can you double check that you copied the code correctly?


        ---
        demerphq

        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
Re: Algorithm::Treap
by BrowserUk (Patriarch) on Sep 08, 2003 at 01:16 UTC

    One question and one suggestion.

    The question: Why would I use a Treap and what benefits (and costs) would result as compared to using a normal hash for that (those) same applications?

    The Suggestion: If you get around to modifying the interface to take a single custom comparator rather than the two as now (which I think makes a lot of sense), then you might also consider making the standard alpha comparator (cmp) a settable option in place of the default numeric comparator. That is to say, have a new()/tie() time option that can be set to indicate that the caller wants an alpha-sort order used rather than a numeric and then use cmp rather than <=> internally to the module rather than requiring the user to supply a comparator function that does this.

    The main reason is that callbacks can be expensive. The best demonstration of this is the GRT. Even though performing a numeric sort using a GRT requires stringifying the numeric (part of the) values to be sorted, into numerically orderable strings (ie. fixed width with leading zeros), this pre-sort overhead is more than compensated for by performance gained by avoiding the need to callback into user code for the comparisons during the sort. I think that the overhead of an if or trinary conditional in the module code to determine which operator to use when inserting and/or balancing (is that the right term with a Treap?) would be easily offset by the benefit of not invoking a callback for the common cases of a 'trivial comparator', ie. numeric or alpha.

    As an aside. I've long wished that perl had a numeric sort option built in (would one extra built-in called sortn be any great burden?), so that many cases where sorting numerically currently requires the overhead of a trivial callback or user block, or resorting to the GRT (or tyes recently posted variation on the theme) would become unnecessary.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      As an aside. I've long wished that perl had a numeric sort option built in (would one extra built-in called sortn be any great burden?), so that many cases where sorting numerically currently requires the overhead of a trivial callback or user block ...
      Not true. Perl does optimize some trivial sort code blocks, so if it sees something that looks like
      @sorted = sort { $a <=> $b } @list;
      it will turn it into a native numerical sort. Benchmarks will show that there is no significant speed difference between native sorting as string, and sorting numerically with this block: there simply is no callback overhead.

      I don't know the exact list, but it looks like at least normal and reversed numerical and string comparisons are in the list of blocks that get optimized away.

      Here are some benchmarks. See how close the results are, numerical sorting is typically even faster than string sortings due to a faster comparison operator. reversed_string and native are even as good as equally fast!

      For comparison I've also added a "messed up" callback block which doesn't use one of the idiomatic callback blocks, and doesn't get optimized away. See how efficiency falls back to one fifth of the optimized numerical sort, from 375 iterations / second, to a mere 75, and a third of the built-in native sort. The latter is the result you were expecting for the general case.

      #!/usr/bin/perl -w use strict; use Benchmark; my(@sorted, @unsorted); push @unsorted, rand 10000 for 1 .. 1000; timethese(-3, { native => sub { @sorted = sort @unsorted }, string => sub { @sorted = sort { $a cmp $b } @unsorted }, reversed_string => sub { @sorted = sort { $b cmp $a } @unsorted }, numerical => sub { @sorted = sort { $a <=> $b } @unsorted }, reversed_numerical => sub { @sorted = sort { $b <=> $a } @unsorted + }, messed_up=> sub { @sorted = sort { my $cmp = $a <=> $b; $cmp } @un +sorted }, });
      Results with a Perl 5.6.1:
      Benchmark: running messed_up, native, numerical, reversed_numerical, r +eversed_string, string, each for at least 3 CPU seconds... messed_up: 3 wallclock secs ( 3.29 usr + 0.00 sys = 3.29 CPU) @ 75 +.99/s (n=250) native: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 24 +0.89/s (n=754) numerical: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ 37 +5.16/s (n=1193) reversed_numerical: 2 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 C +PU) @ 379.87/s (n=1189) reversed_string: 4 wallclock secs ( 3.62 usr + 0.00 sys = 3.62 CPU) + @ 239.78/s (n=868) string: 3 wallclock secs ( 3.24 usr + 0.00 sys = 3.24 CPU) @ 22 +6.54/s (n=734)
        $ perl -MO=Deparse -e 'my @a = 0 .. 10; sort @a;' my(@a) = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10); sort @a; $ perl -MO=Deparse -e 'my @a = 0 .. 10; sort { $a cmp $b } @a;' my(@a) = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10); sort @a;

        So sort { $a cmp $b } @array is optimized away.

        ----
        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        Note: All code is untested, unless otherwise stated

Re: Algorithm::Treap
by deliria (Chaplain) on Sep 08, 2003 at 21:40 UTC
    While these might not be the links your originally had, they might be usefull all the same:

    Section 2 of this paper deals with treaps and was written by Aragon and Seidel.

    This one gives a slight graphical overview.

    Hth,
    Deliria