http://www.perlmonks.org?node_id=153259

A crash course on trees

There are many theoretical things to say about trees. You are not going to hear them from me though. I will spend some words to explain the basics, from a pure utilitarian point of view. For a more complete view on trees, see Sean Burke's article included in HTML::Tree, available on CPAN.

Tree types

When we talk about trees, we often mix two concepts both defined by the same word. A search tree, or binary tree is a structure having two links per node, and it is used to sort items, store them and retrieve them efficiently. Then we have B-Trees, having more than two links per node, commonly used to store information (usually indexes) bigger than the available memory. They use an algorithm to retrieve items from a disk and put them into the main memory on demand. Both binary trees and B-trees share the concept of a calculated insertion. A new node is inserted in the appropriate part of the tree depending on its value. These trees are indexing black boxes that accept a new value and store it following some pre-determined criteria and retrieve it quickly when required.
But we also have trees that don't insert nodes by their internal engine rules, and allow you to create complex structures to suit your needs. Trees of this sort don't insert nodes in the most appropriate spot, don't keep your nodes sorted, actually they don't know anything about sorting nodes. However, they keep your nodes ordered. What's the difference between sorted and ordered? Sorted means that, given a rule, each node will end up in a determined logical position, every time we insert it into the tree. For example, given the rule "alphabetical sort", a node with key "c" will be placed between two nodes with keys "b" and "d". The physical position may change, but the logical result will always be the same. Ordered, on the other hand, could be sorted, but not necessarily. We insert the nodes in the order we want, and the tree structure will keep that order, rather than trying to organize our nodes according to a given rule.
To understand the implication of what I am saying, let's compare the features offered by some different data structures.

Data structures comparison

+-------+--------+----------+ | order | insert | retrieve | +------------+-------+--------+----------+ |array | yes | slow | fast | |hash | no | fast | fast | |linked list | yes | fast | slow | |binary tree | yes | fast | fast | |DAG tree | yes | fast | slow | +------------+-------+--------+----------+
Using a sorted array, we can retrieve elements quite fast, by means of a binary search. However, inserting a new element in a large sorted array is an expensive operation.
Using a hash, both inserting and retrieving are quite fast. But the elements can't be kept ordered.
A linked list keeps a given order and inserts faster than arrays, but searching is sequential, and can be as slow as looking to all the elements.
A binary tree keeps the given order, allows fast insertions and fast retrieving.
A DAG 1 tree, finally, keeps the given order, allows for fast insertion, even though it doesn't have any mechanism for fast retrieval. Why? What's the difference? The fast retrieval in binary trees is due to the insertion mechanism, which inserts the nodes according to sorting rules. Therefore, an inverse rule allows the quick retrieval. DAG trees, instead, give you freedom of insertion and you pay such freedom with less speed in searching.

1 DAG stands for Directed Acyclic Graph, but I stated before that I wasn't going to tell anything theoretical here. So I am afraid that you should resort to the recommended reading.

Tree nodes

A tree is a funny thing, bearing semantics from comparison to different pieces of reality. The name "tree" calls for a botanical symbology, which breaks as soon as we realize that Information Technology trees have their roots at the top. However, following the same idea, we can talk about branches and leaves to describe the forking of sub elements in the tree.
Often, tree elements are described as having a parent (the node above them) and children (the nodes below them). In Tree::DAG_Node, the parent is called mother, and the nodes below are called appropriately daughters.
Whatever we call them, tree elements have a link to two or more nodes below them and a link to another node above them. Such link could be unidirectional (from parent to child only) or bidirectional (so the child can point to its parent.)
The lack of bi-directional link may be seen as a terrible design blunder, since we can go from the top node down to the lowest one, but we can't go back. Everything will become clear when we remember that trees are recursive structures and we access them exploiting this quality of theirs.

Traversing a tree

The most common way of accessing a tree is by recursively accessing the subnodes of the starting node. For efficiency sake, there are methods to avoid recursion, by means of more fortified data structures, but the principle remains.
When we access all the nodes of a tree, we say that we traverse a tree. Depending on what we do while traversing, our walk can be called in-order, pre-order or post-order traversal. The understanding of these concepts is crucial to be able to do anything meningful to a tree.
In-order traversal is when we access the nodes in sort order, and this usually applies to binary trees and B-trees.
The pre-order traversal means accessing a node, doing something and then accessing the nodes below.
Post-order is when we access a node, recurse to the subnodes and then do something.
company / | \ sales R&D pers / \ / \ net store | | research development
pre-order (action: print the node name) company, sales, net, store, R&D, research, development, pers. Useful to print the company organigram, from top to bottom. Had we udes post-order, we would have printed it upside-down.
post-order (action: sum the number of employees) net, store, sales, research, development, R&D, pers, company. Since the action is post-order, we can sum the employees in each node to the ones in the parent. Had we done this in pre-order, we would have given to "company" the employees from "sales" before receiving the quantity from "net" and "store" (i.e. when it was still zero).

What is a Tree::DAG_Node?

Tree::DAG_Node is a class to build generic trees. This is opposed to some specialized trees, such as HTML::trees or XML::Trees. Strictly speaking, as the name suggests, DAG_Node is not a tree, but a node. However, since trees are recursive structures, each node contains a subtree. The DAG_Node class has all the methods to deal with individual nodes or with the whole tree. As my first programming book said, and the module author wants to stress, a program is made of algorithms plus data structures (N. Wirth, 1978). When these two elements are combined together in an object, the result is a powerful engine.
Coming back to our DAG_Node, it is a quite rich module, having a large collection of methods to manipulate both nodes and trees. There are no specialized methods related to the node data, such as the ones of HTML::Element. This lack of specialization doesn't prevent DAG_Node from being a powerful tool even without any addition.

Using Tree::DAG_Node

Using the base class directly

Let's see some basic usage of the class.
#!/usr/bin/perl -w use strict; use Tree::DAG_Node; my $pm = Tree::DAG_Node->new; $pm->name('PerlMonks');
Here we create a node, the root of our tree. The costructor doesn't need any parameter. You can pass some options through a hash reference although it doesn't currently do anything. It is there for the descending classes benefit. After we create a node, we can assign a name. No need to do it, although for this example we are using names as the only visible attribute of each node.
my $tutorials = Tree::DAG_Node->new; $tutorials->name('tutorials'); $tutorials->new_daughter->name('basics'); $tutorials->new_daughter->name('syntax'); $tutorials->new_daughter->name('references'); $tutorials->new_daughter->name('data');
Create a new node, with the same procedure, and then construct some sub nodes (daughters). The new_daughter method is a constructor that creates an object of the same class as the caller and adds it to the list of its subnodes. Since the result of new_daughter() is an object, we can apply the name() method to it. After the above statements, $tutorials is a new tree.
$pm->add_daughter($tutorials);
Now $tutorials becomes a daughter of $pm, retaining all its daughters, which become $pm's descendants.
my $reviews = Tree::DAG_Node->new; $reviews->name('reviews'); $reviews->new_daughter->name('books'); $reviews->new_daughter->name('modules'); my $SOPW = Tree::DAG_Node->new; $SOPW->name('SOPW'); $pm->add_daughter($reviews, $SOPW);
Here we create two more nodes, $reviews with two daughters, and $SOPW, childless. Using add_daughter() we can add them to the root at once. This method accepts either a single node or a list of nodes.
print map "$_\n", @{$pm->draw_ascii_tree};
Prints a ascii art representation of the tree, using the names as identifiers. It might become quite large on complex trees.
$pm->new_daughter->name('Meditations');
One more daughter, using the new_daughter() constructor.
print $pm->dump_names;
This method prints a vertical representation of the tree, improving readability for trees with long lists of daughters. Check the resulting output. The whole representation of PerlMonks structure wouldn't fit in this page.
| <PerlMonks> /---------------------------+-----------\ | | | <tutorials> <reviews> <SOPW> /--------+----------+---------\ /--------\ | | | | | | <basics> <syntax> <references> <data> <books> <modules> 'PerlMonks' 'tutorials' 'basics' 'syntax' 'references' 'data' 'reviews' 'books' 'modules' 'SOPW' 'Meditations'

Traversing a tree

The main method to traverse a tree is walk_down(\%options). This method touches all the nodes in the tree, starting at the calling node, and calls a sub for each one. The sub is passed as a value in the \%options parameter. Depending on the key name, it will be called before ('callback' = pre-order) or after ('callbackback' = post-order) accessing the sub-nodes. You should provide at least one of them. Anything else you define in the %options hash is passed to the sub as second parameter. Especially useful is the "_depth" parameter, which states the starting depth for the tree and it is incremented at each level.
$pm->walk_down({ callback => sub { my $node = shift; print " " x $_[0]->{_depth}; print "(*) " if $node->name eq $_[0]->{treename}; print $node->name, "\n" }, _depth => 0, treename => 'PerlMonks' });
In this call, we define an anonymous sub to print each node, with special treatment for the root. The sub is called for each node, with the node as first parameter and an anonymous hash containing '_depth' and 'treename' as second parameter.
(*) PerlMonks tutorials basics syntax references data reviews books modules SOPW Meditations
Changing the name of the sub key to 'callbackback', the result is a post-order traversal. For a more useful example of post-order, see Inheriting the base class below.
basics syntax references data tutorials books modules reviews SOPW Meditations (*) PerlMonks
One important caveat. The callback sub will be called as long as it returns a true value. A false value will stop the recursion. Useful when we are searching for something, and we want to avoid unnecessary walking down after the result has been achieved. Remember that a print statement returns always true. If you want to stop a printing walk down, you must return "0", an empty string or undef.
But we are not limited to using the walk_down() method. Writing our own traverse routine is quite easy.
sub traverse { my $node = shift; my $depth = scalar $node->ancestors || 0; # a pre-order traversal. First we do something ... print ".." x $depth, $node->name," ", $node->address, "\n"; # ... and then we recurse the subodes traverse($_) for $node->daughters; } PerlMonks 0 ..tutorials 0:0 ....basics 0:0:0 ....syntax 0:0:1 ....references 0:0:2 ....data 0:0:3 ..reviews 0:1 ....books 0:1:0 ....modules 0:1:1 ..SOPW 0:2 ..Meditations 0:3
See Search for nodes below for an explanation of the address() method.
If we want a post-order traversal in the above routine, then it's enough to move the print statement after the recursive call to traverse.

Node attributes

What we have seen so far is not extremely useful. Playing with the node name might be thrilling as long as we limit our programming efforts to drawing ascii trees and counting nodes. But in the real world we need to assign more content to our nodes. To achieve this goal, we could either inherit the base class, or use the attributes() method, which gives us access to a node field, where we can store anything we need. By default, the internal attributes field is set to an empty anonymous hash, but we can set it up to pretty much anything we like, as long as it's a reference.
$pm->attributes (['The', ['best', 'Perl'],['site']]); $tutorials->attributes ({ useful => 'yes', available => ['day','night'] }); $SOPW->attributes (\&check_if_strict); $reviews->attributes(Tree::DAG_Node->new); $pm->walk_down({callback=>sub{ print $_[0]->name," ", ref $_[0]->attributes,"\n"; }});
Of course, the only way to access these attributes is through the attributes() method itself. If we want to use an Object Oriented interface, we need to inherit from Tree::DAG_Node, or to store an object into attributes().

Walking around the tree

Traversing a tree is one of the most common tasks, but not the most flexible one, when we need to access specific nodes relative to others.
Tree::DAG_Node offers a large variety of node walking methods, to go from one node to its neighbours.
| <root> /---------------------------\ | | <a> <b> /---------------\ /---+---+---+---+---\ | | | | | | | | <1> <2> <p> <q> <r> <s> <t> <u> /-------\ /-------\ | | | | <w> <x> <y> <z> /---\ /---\ /---+---\ | | | | | | | <i> <j> <k> <l> <5> <6> <7>
Considering the above tree, starting with $root. Its daughters are accessed by calling $root->daughters. They are returned as a list of nodes.
my @daughters = $root->daughters; # <a> and <b> my @b_daughthers = $daughters[1]->daughters; # <b>'s daughters my $third = $b_daughthers[2]; # <r> my $ls = $third->left_sister; # <q> my $rs = $third->right_sister; # <s> my @left = $third->left_sisters; # <p> and <q> my @right = $third->right_sisters; # <s>, <t> <u> my $mama = $third->mother; # <b> my @ancestors = $third->ancestors; # <b> <root>
Down among the daughters, we can identify them by their position relative to a given node. Every node can access its parent by calling $node->mother. The complete list of the node ancestors is at our disposal by using the appropriate method. If we access the node named "7", for example, its ancestors will be the nodes named 'z', '2','a','root'.
If we want to do something to all the nodes in a subtree, without bothering about using walk_down(), we can use the descendants() method, which returns a list of all the nodes below the calling one. If we want to include the caller, then self_and_descendants() will give us the entire subtree.
Let's say that $node1 represents the node named '1'.
my @descnames = map {$_->name} $node1->descendants; # @descnames = qw(w i j x k l);
Check the reference documentation for more methods to move from node to node.

Inheriting the base class

To inherit from Tree::DAG_Node, we must remember that the class constructor accepts an optional parameter, an anonymous hash. This parameter is not used by the base class constructor, but it's there for the convenience of the descending classes. It means that if we pass parameters to the new() constructor through this anonymous hash, we can use the other constructors without modification. The new_daughter() and new_daughter_left constructors will call new(), passing the anonymous hash that they receive as an argument. The bottom line is, if we can live with the idea of passing parameters through a hash reference, we only need to modify one constructor, otherwise we must rewrite all three of them. In our example, we agree to the convention and we override the main constructor only.
#!/usr/bin/perl -w use strict; package CompanyTree; use Tree::DAG_Node; our @ISA=qw(Tree::DAG_Node); sub new { my $class = shift; my $options = shift; my $self = bless $class->SUPER::new(); $self->attributes($options); return $self; }
This new() constructor simply passes the $options hash reference to the internal attributes(). Actually, nothing prevents us from using a different reference, say an object instead of a hash, but it will do for now. It's worth noting that the author of the base class marks this construction as a bad thing. While I agree in principle, my Impatience took me to embracing this shortcut.
Next, we add three more methods, two being a simple interface to the object's internal data, while the third one adds some general functionality to the base class.
sub employees { my $node = shift; my $val = shift; $node->attributes->{employees} = $val if $val; return $node->attributes->{employees}; } sub budget { my $node = shift; my $val = shift; $node->attributes->{budget} = $val if $val; return $node->attributes->{budget}; } sub by_name { my ($self, $name) = @_; my @found =(); my $retvalue = wantarray ? 1 : 0; $self->walk_down({callback=>sub{ if ($_[0]->name eq $name) { push @found, $_[0]; return $retvalue; } 1}}); return wantarray? @found : @found ? $found[0] : undef; }
The by_name() method walks down the tree starting at the calling node, and returns a list of all the nodes matching the given name. In scalar context, only the first matching node is returned. If the name was not found, returns undef.

Finally, we prepare three mode methods that work specifically with this particular class, using the tree for calculations and showing results.
sub clear_totals { $_[0]->walk_down({ callback => sub { my $node = shift; if ($node->daughters) { $node->budget(0); $node->employees(0); } }}) } sub sum_up { $_[0]->walk_down({ callbackback=> sub { my $node = shift; return 1 unless $node->mother; $node->mother->attributes->{employees} += $node->attributes->{employees}; $node->mother->attributes->{budget} += $node->attributes->{budget}; }}); } sub print_wealth { $_[0]->walk_down({callback=> sub { my $node = shift; printf "%s%.7s\templ: %2d budget: %8d\n", " " x $_[0]->{_depth}, $node->name, $node->employees, $node->budget }, _depth => 0 }); }
clear_totals() sets to 0 the values for employees and budget in each node that has subnodes, thus preparing the way for the next method. sum_up() gets the total values for employees and budget from each subnode, recursively, using a post-order traversal. The total value pops up to the root by virtue of this engine. Eventually, print_wealth() shows the amount of employees and budget for each node and for the whole company.
Now, let's see an application of this new package.
package main; my $company = CompanyTree->new({employees=>0, budget=>0}); $company->name('company'); $company->new_daughter( {employees=>0,budget=>0})->name('sales'); $company->by_name('sales')->new_daughter( {employees=>6,budget=>25_000})->name('net'); $company->by_name('sales')->new_daughter( {employees=>8,budget=>65_000})->name('str'); $company->new_daughter( {employees=>4,budget=>10_000})->name('pers'); $company->new_daughter({employees=>0,budget=>0})->name('R&D'); $company->by_name('R&D')->new_daughter( {employees=>10,budget=>100_000})->name('res'); $company->by_name('R&D')->new_daughter( {employees=>15,budget=>90_000})->name('dev'); $company->clear_totals; $company->sum_up; $company->print_wealth; print map "$_\n", @{$company->draw_ascii_tree};
The new company is created using the derived class. Then, the new_daughter() constructor is used, with an anonymous hash as a parameter, that gets passed to the main constructor. Soon we have three departments, two of which have a few sub departments. We initialize to 0 the number of employees in the non-terminal nodes, to prepare for the summing up. The sum_up() method performs a post-order traversal, to collect the totals. Finally, a pre-order traversal with print_wealth() gives us a nice printout of the company strength.
company empl: 43 budget: 290000 sales empl: 14 budget: 90000 net empl: 6 budget: 25000 str empl: 8 budget: 65000 pers empl: 4 budget: 10000 R&D empl: 25 budget: 190000 res empl: 10 budget: 100000 dev empl: 15 budget: 90000
| <company> /--------+---------\ | | | <sales> <pers> <R&D> /-----\ /-----\ | | | | <net> <str> <res> <dev>

Searching for nodes

In one of the previous examples, we met the address() method. This function returns a string describing the position of a given node within the tree. "0" is the root. "0:0" is the first daughter, "0:1" the second and so on. This address is not a static attribute, but it is calculated on demand. One interesting feature of this method is that if we pass a valid address, it will return the corresponding node.
Thus, in the latest example, 'company' has address '0', 'sales' has '0:0', 'R&D' has '0:2' and 'dev' has '0:2:1'. The address for a given node is always the parent's address plus the position of that node within the parent's daughters' list. If we can figure out the address of a node, we can get the relative node without traversing the tree.
my $node = $root->address('0:2:1');
However, calling nodes by their address is not always convenient, since humans handle names more easily than numbers, and since the address can change if we insert nodes before the one we are looking for.
my $node = $root->address('0:2:1'); $node->mother->new_daughter_left; # now $node's address is '0:2:2'
In the previous section we have seen a sub to retrieve a node by name. It works fine as long as our name is unique. If not, we should build a more selective method that will find exactly what we need. Let's assume that all our nodes have attributes() set as a hash ref, and each one has a key holding a unique identifier. To search for this unique ID is not so hard.
sub by_attribute { my ($self, $key, $id) = @_; my $found = undef; $self->walk_down({callback=>sub{ if (ref $_[0]->attributes eq "HASH" && exists $_[0]->attributes->{$key} && $_[0]->attributes->{$key} eq $id) { $found = $_[0]; return 0; } 1}}); return $found; }
my $node = $root->by_attribute( 'ID', 'nutcracker');
This instruction will return the node with attributes containing { ID => 'nutcracker'}, or undef if not found. For more complex conditions, when your nodes contain objects, remember to check the type of attributes (using ref), to avoid run-time exceptions.

Modifying the tree

One of the advantages of trees over less flexible data structures is their ability of moving subtrees around with minumum effort. To show how Tree::DAG_Node achieves this goal, we'll see some of the moving routines available. Let's build a sample tree, first.
#!/usr/bin/perl -w use strict; use Tree::DAG_Node; my $root = Tree::DAG_Node->new; $root->name('root'); $root->new_daughter->name($_) for ('1'..'3'); my @names = qw(abc def ghi); my $count =0; for my $n ($root->daughters) { for (split //, $names[$count++]) { $n->new_daughter->name($_) } } print map "$_\n", @{$root->draw_ascii_tree};
| <root> /-----------+-----------\ | | | <1> <2> <3> /---+---\ /---+---\ /---+---\ | | | | | | | | | <a> <b> <c> <d> <e> <f> <g> <h> <i>
Our first task is to remove the node named '2' and assign its daughters to the root. We identify the node by its address, and then use the method replace_with_daughters(). The effect of this method is to unlink the node from its mother and to add all its daughter in its place, moving to the right all existing right nodes.
my $node = $root->address('0:1'); $node->replace_with_daughters; print map "$_\n", @{$root->draw_ascii_tree};
The result is that node '2' has disappeared, and in its place we have three new daughters for 'root'.
| <root> /-------+---+---+-------\ | | | | | <1> <d> <e> <f> <3> /---+---\ /---+---\ | | | | | | <a> <b> <c> <g> <h> <i>
The next task is to move the subtree starting at node '3' under node named 'e'.
$node = $root->address('0:4'); my $dest = $root->address('0:2'); $dest->add_daughter($node); print map "$_\n", @{$root->draw_ascii_tree};
Notice that the address of the node to move, that was '0:2' in the original tree, is now '0:4', due to the insertion of new nodes on its left. Calling add_daughter() with $node as an argument causes the link $node->mother to be cut, and a new one to be created in its stead. If we just want to get rid of that subtree, the method $node->unlink_from_mother will do. We should just take care of the memory still occupied by this subtree, and call $node->delete_tree.
| <root> /-------+-------+-------\ | | | | <1> <d> <e> <f> /---+---\ | | | | <3> <a> <b> <c> /---+---\ | | | <g> <h> <i>

Conclusion

So many things, so little space (and time!)
I could go on for several pages, without exhausting the subject1, which is indeed rich and fascinating.
The purpose of this tutorial, however, is only to give you a general view of the module. Once you are confident with its main issues, you will find the reference manual easier to read.
One last recommendation, before I leave you. Trees are memory consuming, and unlike other data structures, they are not cleaned up by Perl's garbage collector (GC). This is due to the many references held in each node, which prevent the GC from sweeping away your tree once it goes out of scope. Therefore, if you don't need the tree any longer, call the delete_tree() method, to let the GC do its job.

1 But I might have exhausted the reader!

General information - Recommended reading

  1. The module documentation
    In the module itself. Mandatory, I would say. There are many juicy methods waiting for you to find them. This tutorial was only a glimpse. I skipped many details that you would like to explore. Pay particular attention to the "See also" section, where you'll find valuable recommendations.
  2. Mastering Algorithms with Perl
    by Jon Orwant, Jarkko Hietanieni, John MacDonald (www.oreilly.com) For the theory behind the babbling in this article, MAwP is a must.
  3. Object Oriented Perl
    by Damian Conway (www.manning.com/conway). Is the most authoritative work on Perl object oriented methodology. Tree theory and implementation is discussed passim with interesting examples.


update
Amended the post-order traversal, that I reported from the wrong example. Thanks ChemBoy.
 _  _ _  _  
(_|| | |(_|><
 _|   

Replies are listed 'Best First'.
Re: Introduction to Tree::DAG_Node
by ChemBoy (Priest) on Mar 21, 2002 at 19:51 UTC

    Excellent node! But a minor quibble: my memories from my intro data structures course are not the strongest, but they suggest that your definition of a post-order traversal is slightly flawed: the Left-Right order of the leaf nodes should be preserved (that is, net should still precede store, and research precede development).

    Not a major issue, of course, for an org chart--but it could be unfortunate if that happened to an expression tree. :-) (thanks to maverick for the link, and for confirming that I'm not totally nuts.)



    If God had meant us to fly, he would *never* have given us the railroads.
        --Michael Flanders

      You are right, of course. Thanks for pointing it out.
      I absentmindedly copied the sequence from the result of an early script where the nodes were in that order. Using the same structure that we have in the company example, the post-order traversal is correct. Just change "callback" into "callbackback" and the printout will be:
      net empl: 6 budget: 25000 str empl: 8 budget: 65000 sales empl: 14 budget: 90000 pers empl: 4 budget: 10000 res empl: 10 budget: 100000 dev empl: 15 budget: 90000 R&D empl: 25 budget: 190000 company empl: 43 budget: 290000 | <company> /--------+---------\ | | | <sales> <pers> <R&D> /-----\ /-----\ | | | | <net> <str> <res> <dev>
       _  _ _  _  
      (_|| | |(_|><
       _|