http://www.perlmonks.org?node_id=631772


in reply to Re^2: storing huge graph
in thread storing huge graph

spx2:

In fact, I do have a bit of experience putting graphs in databases. (I'm currently working on such a project, in fact.)

I don't like the idea of a column having a list of neighbors, though, because it requires you to effectively "jump through hoops" to actually use the data. As you mention, you'll be using split to pull it apart. Similarly, you don't want to have a set of columns to hold your edges, because it limits you to a certain number of edges, and also makes your queries much uglier. In either case, if you actually wanted to write an SQL query to find connect nodes through edges, you'll be fighting the database every step of the way.

There's a much better way: A common idiom in databases is to use a separate table to represent a many-to-many relationship1. Sure, it may take a bit more space in your database, but it's very flexible, and you can do many more queries using it.

For example: Suppose you want to store a directed graph in your database2. You'd use a table to hold your node information, like so3:

create table nodes ( node_id int primary key, name varchar(32) -- other node info goes here )
Next, you create a "many-to-many" table to describe your edges:

create table edges ( from_node_id int not null references nodes(node_id), to_node_id int not null references nodes(node_id), description varchar(32) -- other edge info goes here )
Now you can create a graph that's easy to query using normal SQL, so you don't have to slurp all the data to your local box and then reprocess it. Quickie example:

-- Allowed states for "Daily Work Task" insert graph_nodes values (1, 'Wait/Idle') insert graph_nodes values (2, 'Ready') insert graph_nodes values (3, 'Running') insert graph_nodes values (4, 'Completed') insert graph_nodes values (5, 'Fault') insert graph_nodes values (6, 'Disabled') -- Normal execution path insert graph_edges values (1, 2, 'All prerequisites complete') insert graph_edges values (2, 3, 'User makes request') insert graph_edges values (3, 4, 'Job completes successfully') insert graph_edges values (4, 1, 'NewDay resets us at midnight') -- Abnormal paths insert graph_edges values (3, 5, 'Job faults') insert graph_edges values (5, 6, 'Manual intervention required') insert graph_edges values (6, 1, 'System restarted') insert graph_edges values (5, 2, 'Fault resolved normally') insert graph_edges values (1, 6, 'System halted') insert graph_edges values (2, 6, 'System halted')
Now you can do some simple queries, letting the database do the work for you:

-- Where can I go from node 5? select FR.node_id, '-->', TO.node_id, 'When: ' + E.description from edges E join nodes FR on FR.node_id=E.from_node_id join nodes TO on TO.node_id=E.to_node_id where E.from_node_id = 5
which should yield (untested!):

node_id node_id ------- --- ------- ---------------------------------- 5 --> 2 When: Fault resolved normally 5 --> 6 When: Manual intervention required
Notes:

1: Sure, each edge is a 1-to-1 connection, but from the node's perspective, each node may have 'many' input arcs and 'many' output arcs. So each row of the many-to-many table is a natural description of an arc.

2: If you want a plain (i.e. undirected) graph, then you have two choices: (a) You could rename from_node_id and to_node_id to node_1_id and node_2_id; and modify your queries to take into account that either node ID could be the answer to your query. It's a little more work in your SQL, but not terribly much; or (b) simply add two edges, one from node 1 to node 2, and one from node 2 to node 1. This simplifies your queries at the expense of more database space. (This is what I normally do, as the disk space is easier for me to spend than debugging time....)

3: I'm most familiar with Sybase/MS SQL Server T-SQL syntax, so that's what I'm writing in. You shouldn't have any trouble adapting to Oracle, mySQL, or database of your own choosing.

So if you go to the effort of putting the graph into a database, make sure you take advantage of as many database features as you can to simplify the job--Don't just use it as a "super-big virtual hash".

...roboticus

Replies are listed 'Best First'.
Re: Storing a graph in a database
by BrowserUk (Patriarch) on Aug 10, 2007 at 12:36 UTC

    A virtual roboticus++ for 1 .. 20 as I can't do it for real. For a 'use a database' node that gives (much) more than just that advice. And especially for:

    So if you go to the effort of putting the ... into a database, make sure you take advantage of as many database features as you can to simplify the job--Don't just use it as a "super-big virtual hash".

    Substitute "anything" for the ellipses and this should become a new Monk's Quip.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.