|Welcome to the Monastery|
Storing a graph in a databaseby roboticus (Chancellor)
|on Aug 10, 2007 at 12:02 UTC||Need Help??|
In fact, I do have a bit of experience putting graphs in databases. (I'm currently working on such a project, in fact.)
I don't like the idea of a column having a list of neighbors, though, because it requires you to effectively "jump through hoops" to actually use the data. As you mention, you'll be using split to pull it apart. Similarly, you don't want to have a set of columns to hold your edges, because it limits you to a certain number of edges, and also makes your queries much uglier. In either case, if you actually wanted to write an SQL query to find connect nodes through edges, you'll be fighting the database every step of the way.
There's a much better way: A common idiom in databases is to use a separate table to represent a many-to-many relationship1. Sure, it may take a bit more space in your database, but it's very flexible, and you can do many more queries using it.
For example: Suppose you want to store a directed graph in your database2. You'd use a table to hold your node information, like so3:So if you go to the effort of putting the graph into a database, make sure you take advantage of as many database features as you can to simplify the job--Don't just use it as a "super-big virtual hash".
Next, you create a "many-to-many" table to describe your edges:
Now you can create a graph that's easy to query using normal SQL, so you don't have to slurp all the data to your local box and then reprocess it. Quickie example:
Now you can do some simple queries, letting the database do the work for you:
which should yield (untested!):
1: Sure, each edge is a 1-to-1 connection, but from the node's perspective, each node may have 'many' input arcs and 'many' output arcs. So each row of the many-to-many table is a natural description of an arc.
2: If you want a plain (i.e. undirected) graph, then you have two choices: (a) You could rename from_node_id and to_node_id to node_1_id and node_2_id; and modify your queries to take into account that either node ID could be the answer to your query. It's a little more work in your SQL, but not terribly much; or (b) simply add two edges, one from node 1 to node 2, and one from node 2 to node 1. This simplifies your queries at the expense of more database space. (This is what I normally do, as the disk space is easier for me to spend than debugging time....)
3: I'm most familiar with Sybase/MS SQL Server T-SQL syntax, so that's what I'm writing in. You shouldn't have any trouble adapting to Oracle, mySQL, or database of your own choosing.