Greetings monks. I'm in the early planning stages for a database application that will be written in Perl. One requirement that I know will be present is version control. Users will need to ability to do exclusive checkouts and checkins, as well as examine past versions of objects and revert the object to a past state. The objects are complicated enough that they will require several tables to store. Users will be doing the normal list, edit, view, save stuff with them aside from the versioning actions.

When I first encountered the problem my mind jumped to CVS. I'll just hack up a database-to-CVS bridge and handle it that way, thought I. But as I think about it further, I'm not sure that's a good idea. There's a large impedance mismatch between records in a database and files in CVS.

An alternative might be to have a version column in every table. The problem with this is that the primary key for each column is now (foo_id, version) rather than just (foo_id). I fear this will complicate the rest of the application enormously. Join tables in particular would become much less useful. It's so much nicer if a single integer is enough to identify a record.

Any bright ideas floating around in the monestary? Feel free to meditate on this one for a while - the application is at least a couple months away from implementation.


  • Comment on Version Control in Database Applications

Replies are listed 'Best First'.
Re: Version Control in Database Applications
by crazyinsomniac (Prior) on Jun 10, 2002 at 05:32 UTC
Re: Version Control in Database Applications
by hossman (Prior) on Jun 10, 2002 at 06:03 UTC
    I've seen two approches to this problem...
    1. Generic ID vs. Specific ID

      The principle here is that you still use whatever unique identifiers you would without versioning, and use those in whatever mapping tables / foreign keys you need -- call those your Specific IDs (specific to a particular version of the object). On top of that you have additional information that maps a Generic ID (generic for the object, regardless of version) to the currently "live" specific object, who has the generic checked out, etc.... Most of your clients will only ever use the Generic IDs, the only time any one ever needs to worry about a Specific ID is when they want to revert to a previous version, or make a "future" version "live"

    2. Live tables vs Archive tables

      Again, use your normal schema with your normal unique IDs, but you add an additional "archival" table that flattens all of your data (maybe as an XML doc in a text field) and keeps track of the versioning info ... this is really only usefull for keeping a historic log. When making a new version, flatten the current one first and archive it. If anybody wants to "restore" an old version, de-flatten it.

    Both methods have complexities. You have to think long and hard in advance about what makes up a single "object", what is it people "lock", how do you deal with related objects, and adding/removing/modifing relationships in different versions. As long as you do a good job of picking your rules in advance, either method can work.

Re: Version Control in Database Applications
by chromatic (Archbishop) on Jun 10, 2002 at 05:20 UTC
    I did something rather similar in the Slash Wiki plugin. It's not a brilliant implementation, but it served my purposes.
Re: Version Control in Database Applications
by tomhukins (Curate) on Jun 10, 2002 at 16:06 UTC
Re: Version Control in Database Applications
by samtregar (Abbot) on Jun 10, 2002 at 16:50 UTC
    Thanks guys! Your suggestions have been very helpful, particularly the ones from hossman and tomhukins.

    I think I have an idea for an implementation that would be both easy on queries and on updates. Basically, instead of using an INT column for my ids I'd use a VARCHAR. Then the current version of every object is just the id number, say "1024". When I create a new version I append the version number creating a new id - "1024,1" for version 1 for example. This makes creating a new version as simple as cloning records in all affected tables under the new id. SELECTs and JOINs stay simple since selecting on the id as-is always gets you the current version.

    So, how does that sound? Am I missing something that will cause me premature hair-loss later?


      I think you may find that a composite key on two INTEGERs is faster than a single VARCHAR key (and smaller as well). Besides the composite key allows easier gathering of history for a single item. Compare:
      SELECT * FROM tablename WHERE key = 12345 ORDER BY version;
      SELECT * FROM tablename WHERE key LIKE '12345,%' ORDER BY key;
      The first most likely will get to use an index on key (especially if you build the index with key first and version second) and the second is most likely a table scan. Here are the singleton selects for the current version
      SELECT * FROM tablename o WHERE key = ? AND version = ( SELECT max(version) FROM tablename i WHERE i.key = o.key )
      SELECT * FROM tablename WHERE key = ( SELECT max(key) FROM tablename WHERE key LIKE '$key,%' )
        Re: Speed. My guess is that given an indexed fixed-width CHAR field I should be able to get acceptable speed on lookups. All I need to do is decide on the maximum total number of objects and the maximum number of versions and I should be able to pick a field width. I'll be using MySQL and my memory is that MySQL indexes CHARs just as well as it does INTs.

        Re: Querying. I think you misunderstood (or I did). In my planned implementation the current version would always exist without an appeneded ",version". Thus, selecting the current version is just:

        SELECT * FROM foo WHERE id = "1024"

        No sub-select required! And selecting version N is simply:

        SELECT * FROM foo WHERE id = "1024,N"

        Now, I may still need an extra column to store the current version number but as long as it isn't needed on every select and every join I'll be happy.

        I think you're correct about collecting aggregate information across versions being hard. But I don't anticipate needing to do that very often, if at all.


Re: Version Control in Database Applications
by perrin (Chancellor) on Jun 11, 2002 at 01:38 UTC
    I've often thought about this one too. I've seen implentations that used most of the tricks described in Greenspun's article, and also some that treated a database like CVS. Allaire Spectra serialized all data objects into XML blobs when storing them in the database, which made versioning easier but was not very efficient. Interwoven TeamSite collects all of your data in XML and then publishes it to the database using an XML-to-RDBMS bridge, but that means duplicating your data structure definitions and is basically a kludge. Most people end up just giving up on the idea of real versioning and using some kind of assembly stage instead, i.e. assemble everything on an internal content server and then periodically publish it all to the live server, like making a release from CVS.

    For most content management problems like this, I think the multi-part key (id + version #) is the best way to go. You just keep a VERSIONS table which stores the current version number for each versioned table. You can even make this work for relationship tables if you choose a side that the relationship is from, e.g. you link foo (version 19) to bar (latest). If you put version numbers on both sides of the link and then one of the objects gets reverted, you can't easilly resolve it.

    In order to make this easier to deal with from a progrmming perspective, I would suggest wrapping a database view around it. Then you can have simple keys in your queries and the view will join against the VERSIONS table and show the right one.

Re: Version Control in Database Applications
by lachoy (Parson) on Jun 11, 2002 at 03:24 UTC

    I've actually had the idea (but no motivation in the way of a project) to do this in SPOPS. This would take the form of an extra table:

    CREATE TABLE versions ( class varchar(75) not null, object_id varchar(255) not null, version numeric(8,2) not null, object_data blob not null, primary key( class, object_id, version ) )

    (There would likely be problems with the width of this primary key but I'd probably restrict the object_id to 150 characters or so.... implementation issue.)

    After every update that happens to the object, we create a rule that saves the object state to the database with an incremented version number using Storable if you can be Perl-only, XML if you want to be cross-platform. Fetching all versions of an object would be simple, and in the (hopefully unlikely) event of a primary key modification, we can do an update to ensure the information is migrated properly.

    Having a single table with all versions is attractive as long as you don't need to search on a feature of the versioned content -- for instance, finding all users that at one time had a first name of 'Steve'. It also insulates you from future schema changes, since you don't care about the schema at all.

    I'd think you could also do this using whatever persistence framework you want with something like Class::Observable or Class::Trigger.

    M-x auto-bs-mode

      That's a very intersting suggestion. It hadn't occurred to me to implement versioning with something as simple as Storable. I think this may be the winner. I doubt I'll need to implement more than view() and revert() on past versions. Both of which would be easy to implement with a frozen object.