Storing a graph in a database

In fact, I do have a bit of experience putting graphs in databases. (I'm currently working on such a project, in fact.)

I don't like the idea of a column having a list of neighbors, though, because it requires you to effectively "jump through hoops" to actually use the data. As you mention, you'll be using split to pull it apart. Similarly, you don't want to have a set of columns to hold your edges, because it limits you to a certain number of edges, and also makes your queries much uglier. In either case, if you actually wanted to write an SQL query to find connect nodes through edges, you'll be fighting the database every step of the way.

There's a much better way: A common idiom in databases is to use a separate table to represent a many-to-many relationship¹. Sure, it may take a bit more space in your database, but it's very flexible, and you can do many more queries using it.

For example: Suppose you want to store a directed graph in your database². You'd use a table to hold your node information, like so³:

create table nodes (
    node_id int primary key,
    name varchar(32)
    -- other node info goes here
)
[download]

Next, you create a "many-to-many" table to describe your edges:

create table edges (
    from_node_id int not null references nodes(node_id),
    to_node_id int not null   references nodes(node_id),
    description varchar(32)
    -- other edge info goes here
)
[download]

Now you can create a graph that's easy to query using normal SQL, so you don't have to slurp all the data to your local box and then reprocess it. Quickie example:

-- Allowed states for "Daily Work Task"
insert graph_nodes values (1, 'Wait/Idle')
insert graph_nodes values (2, 'Ready')
insert graph_nodes values (3, 'Running')
insert graph_nodes values (4, 'Completed')
insert graph_nodes values (5, 'Fault')
insert graph_nodes values (6, 'Disabled')

-- Normal execution path
insert graph_edges values (1, 2, 'All prerequisites complete')
insert graph_edges values (2, 3, 'User makes request')
insert graph_edges values (3, 4, 'Job completes successfully')
insert graph_edges values (4, 1, 'NewDay resets us at midnight')

-- Abnormal paths
insert graph_edges values (3, 5, 'Job faults')
insert graph_edges values (5, 6, 'Manual intervention required')
insert graph_edges values (6, 1, 'System restarted')
insert graph_edges values (5, 2, 'Fault resolved normally')
insert graph_edges values (1, 6, 'System halted')
insert graph_edges values (2, 6, 'System halted')
[download]

Now you can do some simple queries, letting the database do the work for you:

-- Where can I go from node 5?
select  FR.node_id, '-->', TO.node_id, 'When: ' + E.description
from    edges E
join    nodes FR on FR.node_id=E.from_node_id
join    nodes TO on TO.node_id=E.to_node_id
where   E.from_node_id = 5
[download]

which should yield (untested!):

node_id     node_id
------- --- ------- ----------------------------------
      5 -->       2 When: Fault resolved normally
      5 -->       6 When: Manual intervention required
[download]

Notes:

1: Sure, each edge is a 1-to-1 connection, but from the node's perspective, each node may have 'many' input arcs and 'many' output arcs. So each row of the many-to-many table is a natural description of an arc.

2: If you want a plain (i.e. undirected) graph, then you have two choices: (a) You could rename from_node_id and to_node_id to node_1_id and node_2_id; and modify your queries to take into account that either node ID could be the answer to your query. It's a little more work in your SQL, but not terribly much; or (b) simply add two edges, one from node 1 to node 2, and one from node 2 to node 1. This simplifies your queries at the expense of more database space. (This is what I normally do, as the disk space is easier for me to spend than debugging time....)

3: I'm most familiar with Sybase/MS SQL Server T-SQL syntax, so that's what I'm writing in. You shouldn't have any trouble adapting to Oracle, mySQL, or database of your own choosing.

So if you go to the effort of putting the graph into a database, make sure you take advantage of as many database features as you can to simplify the job--Don't just use it as a "super-big virtual hash".

...roboticus

Comment on Storing a graph in a database Select or Download Code