Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Trees in SQL, adjacency lists and sorting.

by BUU (Prior)
on Jan 05, 2006 at 20:59 UTC ( [id://521340]=perlquestion: print w/replies, xml ) Need Help??

BUU has asked for the wisdom of the Perl Monks concerning the following question:

Recently I found myself wishing to store a tree in a database. A secondary goal was to do so very, very fast. (Why? Because). After much pondering and head scratching I came up with what is basically an adjacency list. Here is an example:

Give a tree that looks like:
A / \ B C / \ \ D E F
I store that in a table that looks like this:
node_id | ancestor | distance ----------------------------- B | A | 1 C | A | 1 D | B | 1 D | A | 2 E | B | 1 E | A | 2 F | C | 1 F | A | 2
Note that nodes D, E and F have multiple rows in the table.

The advantage of this lay out is that it is very, very fast to select from. I can get any number of children with a single query. For example, all of the children of B would be: SELECT * FROM table WHERE ancestor = 'B' which returns "D" and "E". The downside is that it takes more room on disk. However, some basic testing with SQLite showed me that I could fit some 5 million of these rows in a table as small as 70 megs, so I'm not overly worried about that.

As I've described it, my table works great. Selecting is very fast, I can find any level I want easily, it's fast to add new ones and so on. My single problem is that I figure out how to order my nodes once I've selected them. Give the above tree/table, my ideal order looks like: A, B, D, E, C, F.

So I get to my question. How can I sort my nodes so that they come out in the order I want? I'd definitely prefer something that would be pure sql, mostly for speed purposes. I'm also open to suggestions of what I could add to my table to enable easier/faster ordering, or perhaps a better way to store this.

I've considered several other methods of storing trees in my database. The first method I considered was simply having each node store a parent node. This is space efficient but very slow, a simple 10 node tree could take up to 10 queries to return all the nodes. The second thought was using Celko's left/right number tree, which is fast to select from and fairly easy to utilize, but when you add a node it requires updating every other node in the tree, I thought this would be inefficient for large trees.

Update: You can insert at any point in the tree at any time or move any node to any other spot.

Replies are listed 'Best First'.
Re: Trees in SQL, adjacency lists and sorting.
by Fletch (Bishop) on Jan 05, 2006 at 21:05 UTC
Re: Trees in SQL, adjacency lists and sorting.
by dragonchild (Archbishop) on Jan 06, 2006 at 02:42 UTC
    This is, yet again, another node displaying a misunderstanding of relational theory. These things are sets, which are inherently unordered. Ordering is provided as an additional feature, like triggers. It's not part of the relational theory and, in fact, violates relational theory. That said ...
    1. I'm assuming you're going to have a row that says "A | NULL | 0". You need to represent the root somewhere. Implicit representation is not going to cut it.
    2. This is going to be an absolute nightmare to update. For example, if you remove a node and promote one of the children (a common operation in red-black trees, for example), You're going to have to recalculate in your application every single depth. Furthermore, if your application code is wrong in any way, the database isn't going to help you catch your error, particularly in the depth area.
    3. Trees are ordered with respect to siblings. In your case, you're assuming that B comes before C and D comes before E. How are you planning on capturing that in your schema? You're storing depth (which is easily derived and doesn't really help improving select speed), but you're not storing sibling sort order (which isn't derivable because you cannot depend on the order the RDBMS will store the rows).
    4. Assuming you have all of that, let's see what your sort would look like in Perl:
      my @nodes = map { $_->[0] } sort { # What goes here?? } map { [ $_->{node_id}, $_->{depth}, $_->{sibling_order} ] } $sth->fetchall_hashref();

      If you wanted a level-order traversal, you'd use:

      $a->{depth} <=> $b->{depth} || $a->{sibling_order} <=> $b->{sibling_order}
      Except, that gives you A-B-C-D-E-F-G, which isn't what you want.

      Take a look at Tree, particularly the traversal code I wrote in order to provide traversals as an iterator into the tree versus an array (which is what you're trying to do in an ORDER BY clause). The code is non-trivial, particularly in that it requires maintaining a stack to keep track of what comes next.

    In short, you cannot do it with ANSI SQL-1992, SQL-1999, or even SQL-2003. You would have to use a vendor-specific extension that allows user-defined comparators for sorting. I recommend PostgreSQL as MySQL, Oracle, Sybase, and SQL*Server all don't provide this. SQLite definitely doesn't provide this.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      So what's the Right Way to store a tree? You seem to have fairly strong opinions on the matter and I'm always willing to learn. As for the every single depth bit, if I delete a node I can update all its children in a single query.
        Data structures exist to organize data in ways that are useful for what we're going to do with it. For example, if all you're doing is storing a HoHoHoHoH that you want to walk but never search, then a database is an extremely poor solution and something like DBM::Deep is preferable. If you have a bunch of items that are on the same level and you want to do random-access searches based on arbitrary criteria, then a relational database is much better than DBM::Deep.

        How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node.

        It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings.

        Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at.

        As for the every single depth bit, if I delete a node I can update all its children in a single query.

        I don't think so.

        Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Trees in SQL, adjacency lists and sorting.
by suaveant (Parson) on Jan 05, 2006 at 21:21 UTC
    One way I have done it in the past is using coded strings.... used it for threaded discussions... so you have say a 3 character key... top level is
    AAA AAB ... 000
    children of AAA are
    AAAAAA AAAAAB ... AAA000
    etc... it limits how wide you can go and how deep, but it works well for querying...

                    - Ant
                    - Some of my best work - (1 2 3)

      It's an interesting thought and I tried doing something similar to that (but with slashes) and I couldn't make it work. Do you have any specific suggestions on how I should implement it given my current structure?
        Well.. obviously you have a large unique field for the string....

        I did the following... I had a unique varchar(255) that held the tree string and used 2 digit base 36 ids... obviously you could extend the character set, but I figured 1296 children per leaf was enough. If you only need binary then one char is fine, 0-1 or something... you could probably encode something, but that may make queries difficult... obviously with 2 char codes you can only get 127 levels in a tree with a 255 char field.

        I used a couple of self written functions to turn the codes to and from numbers:

        sub de36 { my($a,$b) = (uc substr($_[0],-2,1),uc substr($_[0],-1,1)); $CHARS{$a}*36+$CHARS{$b}; } sub en36 { $CHARS[$_[0]/36].$CHARS[$_[0]%36]; }
        To insert I would would first grab the max id within the group I wanted, something like
        SELECT MAX(thread) FROM messages WHERE thread LIKE "${parent}__"
        matching those threads that were in the same parent....

        you can get all immediate children with stuff like

        SELECT * FROM messages WHERE thread LIKE "${parent}__"
        and you can get a history with something like
        SELECT * FROM messages WHERE thread = LEFT("$thread",LENGTH($thread))
        sorting is, in this case, super easy.

        I am using MySQL syntax but it should be clear enough...

        You'll have to work out how to make it binary yourself, if you need it... not terribly difficult, select and compare, then add accordingly... table locking could probably help, too. :)

                        - Ant
                        - Some of my best work - (1 2 3)

      This style of notation can also be used with polyheirarchies (items w/ more than one parent), if you leave a couple of reserved characters for the notation. Here's an example that my teacher gave us for a class on thesaurus design:

      So, '10th grade optics' would be 'A1.1.1B1.1'. We use a delimitor between items, so we aren't restricted to subsets that each have a unique character in them, allowing for wider branching. (depth is still a problem when dealing with fixed line lengths)

      The problem is, it's fine for humans to understand, but a little messy to sort on in SQL. You can get around this by adding another delimiter between facets, and ending every notation w/ a null heirarchy: ';A1.1.1.;B1.1.'. We can then get all items relating to 10th grade physics w/ LIKE '%;A1.1.%;B1.1.%', without fear of getting 'A1.10AB1.1', and still getting items that may have even more facets, eg a movie for 10th grade physics ';A1.1;B1.1;C1'

      The only requirement is that you keep each of the facets in order. (what order doesn't matter, so long as you're consistent)

      If you'd like a more complex example:

Re: Trees in SQL, adjacency lists and sorting.
by demerphq (Chancellor) on Jan 06, 2006 at 09:03 UTC

    Seems to me that this strategy is only really suitable for small trees. For instance to hold a list of N nodes in your tree you need (N+1)*(N/2) records. This means that for even a moderately sized tree you will be dealing with a lot of records. And the sparser the tree the less efficient it will be.

    Also, iirc celko algorithm doesn't necessarily require updating all the nodes. And the updates required are simple and efficient. IMO, given that the latter would only hold N records for N nodes Id bet updating it would be more efficient that working with your structure due to scaling.

    Anyway, you haven't outlined what operations you need to perform on your trees and how often they occur. Without that info its hard to advise you beyond that your proposed algorithm looks unsatisfactory due to scaling.

    Also have you considered how much work needs to be done if you reparent a node? I should think such an operation would be woefully painful given your current structure.

    update:Corrections and I removed some needless comments. Shouldnt post before you drink coffee.

    ---
    $world=~s/war/peace/g

Re: Trees in SQL, adjacency lists and sorting.
by blahblahblah (Priest) on Jan 06, 2006 at 01:56 UTC
    Here's a link to another node about storing trees in SQL: Tree Structure and Db. It has lots of replies which link to articles about trees and sql. (Sorry I can't describe it more specifically -- I bookmarked it a while ago and still haven't gone back to read it.)
Re: Trees in SQL, adjacency lists and sorting.
by renodino (Curate) on Jan 06, 2006 at 03:50 UTC
      I cannot recommend Joe Celkos books highly enough. This book should give you what you need. I used the nested set method for making an FTP server with DB backend and it worked a treat.
Re: Trees in SQL, adjacency lists and sorting.
by traveler (Parson) on Jan 05, 2006 at 21:31 UTC
    "My single problem is that I figure out how to order my nodes once I've selected them. Give the above tree/table, my ideal order looks like: A, B, D, E, C, F"

    So if you do  SELECT * FROM table WHERE ancestor = 'A' you want the ordered list A, B, D, E, C, F. That sounds like an preorder traversal of the tree. Doing that in pure SQL is a bit byond me, but you mnight google for "sql preorder tree" to find something.

    HTH

Re: Trees in SQL, adjacency lists and sorting.
by DungeonKeeper (Novice) on Jan 06, 2006 at 10:33 UTC
    The standard approach, despite the temptation to just use one table with an auto-foreign key, is:
    CREATE TABLE node (node_name VARCHAR(32) NOT NULL) CREATE TABLE node_node ( parent_fk VARCHAR(32) NOT NULL, child_fk VARC +HAR(32) NOT NULL )
    The next bit is DBMS specific but you also need to define the primary keys: (node.node_name) and (node_node.parent_fk, node_node.child_fk) and the foreign key constraints: node.node_name -> node_node.parent_fk and node.node_name -> node_node.child_fk.

    The advantage of this over the single table solution is that it is normalised and can be extended into a multi-type, multi-tree model without having to keep adding foreign keys to the master table.

      The problem with this is that you can't do a single select query to get an entire branch of the tree in the correct order. You have to write something like a recursive function and you end up with one query per level of the tree.

      Someone above mentioned "nested set" - this is what I use and it's great for quick lookups, at the cost of expensive (relatively) inserts. I've never had a database big enough where inserts have been a problem.

        Thats not really true. Assuming your nodes can be ordered, and assuming you store a root id, you grab everything from the tree on or later to the node you are interested in. Then filter out the noncompliant nodes when you read them. One query sufficies then. Admittedly at the cost of some client side processing, but if that isnt a problem then it should be fine.

        For the record this exactly what is doe in Recently Added Threads. We grab all the relevent nodes in two queries normally. (Roots then children as our roots arent the same type as the children).

        ---
        $world=~s/war/peace/g

        You certainly can't do that with the single table solution either. Such a branch is iterative and therefore cannot be expressed as an RSE whichever RDBMS solution you choose.

        Everything but the troll

Re: Trees in SQL, adjacency lists and sorting.
by talexb (Chancellor) on Jan 06, 2006 at 13:54 UTC

    Interesting node -- but I see a problem brewing with the rows that have 'distance to the next node' > 1. That's data that can be inferred, and if it gets entered wrong, or if someone monkeys with the database, the whole thing comes crashing down.

    However, I think I understand what you're doing -- you're trying to pre-cook as much as possible for performance. Thus I'd probably have two tables: the first would have just nearest neighbours, and the second would have all of the inferred data. Then you could use a UNION to get the same performance as your original design.

    Just some idle thoughts over my morning coffee.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Trees in SQL, adjacency lists and sorting.
by Anonymous Monk on Jan 06, 2006 at 22:46 UTC
    If you want ordered traversal, I'd add an 'order' column with sibling order, analogous to the depth column. Add a row ('a','a',0,0). select node_id from tree where ancestor='a' order by depth,order;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://521340]
Approved by Limbic~Region
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-24 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found