Trees in SQL, adjacency lists and sorting.

BUU has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Trees in SQL, adjacency lists and sorting. by Fletch (Bishop) on Jan 05, 2006 at 21:05 UTC
"A Look at SQL Trees" might be of interest; and there's more in the next month's article. Not that it's really Perl related, though . . . :)	[reply]
Re: Trees in SQL, adjacency lists and sorting. by dragonchild (Archbishop) on Jan 06, 2006 at 02:42 UTC
This is, yet again, another node displaying a misunderstanding of relational theory. These things are sets, which are inherently unordered. Ordering is provided as an additional feature, like triggers. It's not part of the relational theory and, in fact, violates relational theory. That said ... I'm assuming you're going to have a row that says "A \| NULL \| 0". You need to represent the root somewhere. Implicit representation is not going to cut it. This is going to be an absolute nightmare to update. For example, if you remove a node and promote one of the children (a common operation in red-black trees, for example), You're going to have to recalculate in your application every single depth. Furthermore, if your application code is wrong in any way, the database isn't going to help you catch your error, particularly in the depth area. Trees are ordered with respect to siblings. In your case, you're assuming that B comes before C and D comes before E. How are you planning on capturing that in your schema? You're storing depth (which is easily derived and doesn't really help improving select speed), but you're not storing sibling sort order (which isn't derivable because you cannot depend on the order the RDBMS will store the rows). Assuming you have all of that, let's see what your sort would look like in Perl: `my @nodes = map { $_->[0] } sort { # What goes here?? } map { [ $_->{node_id}, $_->{depth}, $_->{sibling_order} ] } $sth->fetchall_hashref();` [download] If you wanted a level-order traversal, you'd use: `$a->{depth} <=> $b->{depth} \|\| $a->{sibling_order} <=> $b->{sibling_order}` [download] Except, that gives you A-B-C-D-E-F-G, which isn't what you want. Take a look at Tree, particularly the traversal code I wrote in order to provide traversals as an iterator into the tree versus an array (which is what you're trying to do in an ORDER BY clause). The code is non-trivial, particularly in that it requires maintaining a stack to keep track of what comes next. In short, you cannot do it with ANSI SQL-1992, SQL-1999, or even SQL-2003. You would have to use a vendor-specific extension that allows user-defined comparators for sorting. I recommend PostgreSQL as MySQL, Oracle, Sybase, and SQLServer all don't provide this. SQLite definitely* doesn't provide this. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l] [select]
Re^2: Trees in SQL, adjacency lists and sorting. by BUU (Prior) on Jan 06, 2006 at 04:17 UTC
So what's the Right Way to store a tree? You seem to have fairly strong opinions on the matter and I'm always willing to learn. As for the every single depth bit, if I delete a node I can update all its children in a single query.	[reply]
Re^3: Trees in SQL, adjacency lists and sorting. by dragonchild (Archbishop) on Jan 06, 2006 at 14:18 UTC
Data structures exist to organize data in ways that are useful for what we're going to do with it. For example, if all you're doing is storing a HoHoHoHoH that you want to walk but never search, then a database is an extremely poor solution and something like DBM::Deep is preferable. If you have a bunch of items that are on the same level and you want to do random-access searches based on arbitrary criteria, then a relational database is much better than DBM::Deep. How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node. It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings. Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at. As for the every single depth bit, if I delete a node I can update all its children in a single query. I don't think so. Read more... (1150 Bytes) Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l] [select]
Re: Trees in SQL, adjacency lists and sorting. by suaveant (Parson) on Jan 05, 2006 at 21:21 UTC
One way I have done it in the past is using coded strings.... used it for threaded discussions... so you have say a 3 character key... top level is `AAA AAB ... 000` [download] children of AAA are `AAAAAA AAAAAB ... AAA000` [download] etc... it limits how wide you can go and how deep, but it works well for querying... - Ant - Some of my best work - (1 2 3)	[reply] [d/l] [select]
Re^2: Trees in SQL, adjacency lists and sorting. by BUU (Prior) on Jan 05, 2006 at 21:28 UTC
It's an interesting thought and I tried doing something similar to that (but with slashes) and I couldn't make it work. Do you have any specific suggestions on how I should implement it given my current structure?	[reply]
Re^3: Trees in SQL, adjacency lists and sorting. by suaveant (Parson) on Jan 05, 2006 at 21:50 UTC
Well.. obviously you have a large unique field for the string.... I did the following... I had a unique varchar(255) that held the tree string and used 2 digit base 36 ids... obviously you could extend the character set, but I figured 1296 children per leaf was enough. If you only need binary then one char is fine, 0-1 or something... you could probably encode something, but that may make queries difficult... obviously with 2 char codes you can only get 127 levels in a tree with a 255 char field. I used a couple of self written functions to turn the codes to and from numbers: `sub de36 { my($a,$b) = (uc substr($_[0],-2,1),uc substr($_[0],-1,1)); $CHARS{$a}36+$CHARS{$b}; } sub en36 { $CHARS[$_[0]/36].$CHARS[$_[0]%36]; }` [download] To insert I would would first grab the max id within the group I wanted, something like `SELECT MAX(thread) FROM messages WHERE thread LIKE "${parent}__"` [download] matching those threads that were in the same parent.... you can get all immediate children with stuff like `SELECT FROM messages WHERE thread LIKE "${parent}__"` [download] and you can get a history with something like `SELECT * FROM messages WHERE thread = LEFT("$thread",LENGTH($thread))` [download] sorting is, in this case, super easy. I am using MySQL syntax but it should be clear enough... You'll have to work out how to make it binary yourself, if you need it... not terribly difficult, select and compare, then add accordingly... table locking could probably help, too. :) - Ant - Some of my best work - (1 2 3)	[reply] [d/l] [select]
Re^2: Trees in SQL, adjacency lists and sorting. by jhourcle (Prior) on Jan 06, 2006 at 16:00 UTC
This style of notation can also be used with polyheirarchies (items w/ more than one parent), if you leave a couple of reserved characters for the notation. Here's an example that my teacher gave us for a class on thesaurus design: Read more... (331 Bytes) So, '10th grade optics' would be 'A1.1.1B1.1'. We use a delimitor between items, so we aren't restricted to subsets that each have a unique character in them, allowing for wider branching. (depth is still a problem when dealing with fixed line lengths) The problem is, it's fine for humans to understand, but a little messy to sort on in SQL. You can get around this by adding another delimiter between facets, and ending every notation w/ a null heirarchy: ';A1.1.1.;B1.1.'. We can then get all items relating to 10th grade physics w/ LIKE '%;A1.1.%;B1.1.%', without fear of getting 'A1.10AB1.1', and still getting items that may have even more facets, eg a movie for 10th grade physics ';A1.1;B1.1;C1' The only requirement is that you keep each of the facets in order. (what order doesn't matter, so long as you're consistent) If you'd like a more complex example: Read more... (815 Bytes)	[reply] [d/l] [select]
Re: Trees in SQL, adjacency lists and sorting. by demerphq (Chancellor) on Jan 06, 2006 at 09:03 UTC
Seems to me that this strategy is only really suitable for small trees. For instance to hold a list of N nodes in your tree you need (N+1)(N/2) records. This means that for even a moderately sized tree you will be dealing with a lot of records. And the sparser the tree the less efficient it will be. Also, iirc celko algorithm doesn't necessarily* require updating all the nodes. And the updates required are simple and efficient. IMO, given that the latter would only hold N records for N nodes Id bet updating it would be more efficient that working with your structure due to scaling. Anyway, you haven't outlined what operations you need to perform on your trees and how often they occur. Without that info its hard to advise you beyond that your proposed algorithm looks unsatisfactory due to scaling. Also have you considered how much work needs to be done if you reparent a node? I should think such an operation would be woefully painful given your current structure. update:Corrections and I removed some needless comments. Shouldnt post before you drink coffee. --- $world=~s/war/peace/g	[reply]
Re: Trees in SQL, adjacency lists and sorting. by blahblahblah (Priest) on Jan 06, 2006 at 01:56 UTC
Here's a link to another node about storing trees in SQL: Tree Structure and Db. It has lots of replies which link to articles about trees and sql. (Sorry I can't describe it more specifically -- I bookmarked it a while ago and still haven't gone back to read it.)	[reply]
Re: Trees in SQL, adjacency lists and sorting. by renodino (Curate) on Jan 06, 2006 at 03:50 UTC
You might want to peruse this book.	[reply]
Re^2: Trees in SQL, adjacency lists and sorting. by simon.proctor (Vicar) on Jan 06, 2006 at 09:05 UTC
I cannot recommend Joe Celkos books highly enough. This book should give you what you need. I used the nested set method for making an FTP server with DB backend and it worked a treat.	[reply]
Re: Trees in SQL, adjacency lists and sorting. by traveler (Parson) on Jan 05, 2006 at 21:31 UTC
"My single problem is that I figure out how to order my nodes once I've selected them. Give the above tree/table, my ideal order looks like: A, B, D, E, C, F" So if you do `SELECT * FROM table WHERE ancestor = 'A'` you want the ordered list A, B, D, E, C, F. That sounds like an preorder traversal of the tree. Doing that in pure SQL is a bit byond me, but you mnight google for "sql preorder tree" to find something. HTH	[reply] [d/l]
Re: Trees in SQL, adjacency lists and sorting. by DungeonKeeper (Novice) on Jan 06, 2006 at 10:33 UTC
The standard approach, despite the temptation to just use one table with an auto-foreign key, is: `CREATE TABLE node (node_name VARCHAR(32) NOT NULL) CREATE TABLE node_node ( parent_fk VARCHAR(32) NOT NULL, child_fk VARC +HAR(32) NOT NULL )` [download] The next bit is DBMS specific but you also need to define the primary keys: (node.node_name) and (node_node.parent_fk, node_node.child_fk) and the foreign key constraints: node.node_name -> node_node.parent_fk and node.node_name -> node_node.child_fk. The advantage of this over the single table solution is that it is normalised and can be extended into a multi-type, multi-tree model without having to keep adding foreign keys to the master table.	[reply] [d/l]
Re^2: Trees in SQL, adjacency lists and sorting. by saberworks (Curate) on Jan 06, 2006 at 19:25 UTC
The problem with this is that you can't do a single select query to get an entire branch of the tree in the correct order. You have to write something like a recursive function and you end up with one query per level of the tree. Someone above mentioned "nested set" - this is what I use and it's great for quick lookups, at the cost of expensive (relatively) inserts. I've never had a database big enough where inserts have been a problem.	[reply]
Re^3: Trees in SQL, adjacency lists and sorting. by demerphq (Chancellor) on Jan 10, 2006 at 16:27 UTC
Thats not really true. Assuming your nodes can be ordered, and assuming you store a root id, you grab everything from the tree on or later to the node you are interested in. Then filter out the noncompliant nodes when you read them. One query sufficies then. Admittedly at the cost of some client side processing, but if that isnt a problem then it should be fine. For the record this exactly what is doe in Recently Added Threads. We grab all the relevent nodes in two queries normally. (Roots then children as our roots arent the same type as the children). --- $world=~s/war/peace/g	[reply]
Re^3: Trees in SQL, adjacency lists and sorting. by DungeonKeeper (Novice) on Jan 09, 2006 at 10:01 UTC
You certainly can't do that with the single table solution either. Such a branch is iterative and therefore cannot be expressed as an RSE whichever RDBMS solution you choose. Everything but the troll	[reply]
Re^4: Trees in SQL, adjacency lists and sorting. by saberworks (Curate) on Jan 09, 2006 at 23:00 UTC
Re^5: Trees in SQL, adjacency lists and sorting. by DungeonKeeper (Novice) on Jan 10, 2006 at 09:35 UTC
Some notes below your chosen depth have not been shown here
Re: Trees in SQL, adjacency lists and sorting. by talexb (Chancellor) on Jan 06, 2006 at 13:54 UTC
Interesting node -- but I see a problem brewing with the rows that have 'distance to the next node' > 1. That's data that can be inferred, and if it gets entered wrong, or if someone monkeys with the database, the whole thing comes crashing down. However, I think I understand what you're doing -- you're trying to pre-cook as much as possible for performance. Thus I'd probably have two tables: the first would have just nearest neighbours, and the second would have all of the inferred data. Then you could use a `UNION` to get the same performance as your original design. Just some idle thoughts over my morning coffee. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l]
Re: Trees in SQL, adjacency lists and sorting. by Anonymous Monk on Jan 06, 2006 at 22:46 UTC
If you want ordered traversal, I'd add an 'order' column with sibling order, analogous to the depth column. Add a row ('a','a',0,0). select node_id from tree where ancestor='a' order by depth,order;	[reply]


"be consistent"
	PerlMonks