Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

An alternate technique for database primary keys

by sundialsvc4 (Monsignor)
on Nov 05, 2010 at 01:40 UTC ( #869612=perlmeditation: print w/ replies, xml ) Need Help??

Every relational database table should, of course, have a primary key.   This should be a column whose only purpose in life is to uniquely identify the row, and this strictly for the database’s own, internal purposes.   It should be a column whose values are utterly devoid of business or other meaning.

Every DBMS provides one way to do this:   the “auto-increment integer.”   When you insert a new row, this column will automatically be initialized with a value that is supposed to be unique.   It can therefore be used as a primary key.

Sometimes you have no practical choice but to use this.   If you are inserting rows using an INSERT...FROM statement, then this might be the only way to get the job done.   But if you are inserting rows programmatically, here’s another approach that worked really well for me.

My application has a subroutine which will generate an 8-character random string of letters and numbers, prefixed by an optional string provided by the caller.   (The algorithm is a very well-known “Perl one-liner,” therefore not shown here.)   The choice of “8 characters’ is arbitrary, but it should simply be long enough that a “collision” is impossible.   The domain of letters is chosen so that the strings won’t contain any troublesome punctuation-marks.   The prefix makes it readable to humans.   (For instance, you can guess that dataset_1ap6GoWR probably refers to a “dataset” of some kind...)   I call these things monikers.

The advantage of this approach is that “every integer looks exactly like every other integer, but monikers never, ever will.”   If the wrong kind of moniker is in some column, you can immediately detect it by eye.   Or, by a query, e.g.:

select dataset_moniker from datasets where dataset_moniker not like 'dataset_%'

(Footnote:   This is a great application for any of the Test modules in CPAN ...)

My application also features an object called Filer.   Every database manipulation that is done, is done by and through the Filer.   Every question that is asked, and every change that is done, is accomplished by means of Filer methods (or by other “intelligent” program objects which rely exclusively upon Filer for their database-related services).   As this application has continued to develop (it is a valiant and successful attempt to automate program-understanding of a legacy application that issues over 11,000 separate queries against more than 860 tables across a source-code library of 3,600+ DB2, Oracle and SAS programs ... all of them automatically discovered, parsed, analyzed and cataloged by this app ...) its data structures have changed many times.   But the Filer is the only one who has to know.

As is usual for me, I didn’t know quite what I was getting into when I said that this app could be built in Perl.   But it has been successful, and I thought I’d share the Meditation.   HTH!™

Comment on An alternate technique for database primary keys
Re: An improved technique for database primary keys
by duelafn (Priest) on Nov 05, 2010 at 05:15 UTC

    Some databases can handle this directly:

    psql> CREATE SEQUENCE foo; psql> CREATE TABLE bar ( id char(10) PRIMARY KEY DEFAULT 'bar_' || nextval('foo') , name text ); psql> INSERT INTO bar (name) VALUES ('Alice'); psql> INSERT INTO bar (name) VALUES ('Bob'); psql> SELECT * FROM bar; id | name ------------+------- bar_1 | Alice bar_2 | Bob

    Good Day,
        Dean

Re: An improved technique for database primary keys
by fullermd (Curate) on Nov 05, 2010 at 05:32 UTC
    The choice of “8 characters’ is arbitrary, but it should simply be long enough that a “collision” is impossible.
    I woke up screaming when I read that. And I wasn't even asleep.

      Sleep well.   It is a calculated-risk, and of course the index in question must be UNIQUE so that colliding values, if they were somehow generated, would not be stored.   You could, of course, go so far as to use Microsoft’s UUIDs (GUIDs).

      The notion is simply that, if the string is random and sufficiently long, key collisions in-practice “ain’t never actually gonna happen.”   Your mileage may vary.   Choose appropriately.   If, in whatever context you may be dealing with, this admonition isn’t sound, then don’t follow it.   Peace.

Re: An improved technique for database primary keys
by moritz (Cardinal) on Nov 05, 2010 at 08:55 UTC
    This should be a column whose only purpose in life is to uniquely identify the row, and this strictly for the database’s own, internal purposes.

    Which is why it makes me uneasy to generate the primary key for new rows outside the database. Sometimes you don't want to use Perl to talk to the database (for example using the database's command line client for administrative tasks is very handy), and it would be a pity if you couldn't insert any rows with reasonable effort without going through your Perl scripts/modules.

    I don't know if the databases that you have to work with support generating such sequences, but I for one would feel uneasy about moving such a core component of database integrity out of the database, if it can be avoided by any means.

    Perl 6 - links to (nearly) everything that is Perl 6.

      As noted, this task is performed by a Filer module within my application, which is “solely responsible for its content” (and which provides a public moniker-generating subroutine for anyone and everyone to use).   It is a calculated, and I think in this case defendable, design choice.   But your point is well-taken.

Re: An improved technique for database primary keys
by JavaFan (Canon) on Nov 05, 2010 at 10:06 UTC
    My application has a subroutine which will generate an 8-character random string of letters and numbers, prefixed by an optional string provided by the caller.
    That can be quite a performance killer if your database stores rows in primary key order. Storing rows in primary key order has some advantages: it saves on the overhead size of the index, and new rows will always go on the end: no need to split pages, and less rebalancing of the index. Furthermore, I prefer to keep my rows as small as possible (smaller rows means more rows/page, more rows/page means less pages needed, less pages needed means less disk I/O, less disk I/O means better performance). If I can use a tinyint as primary key, I use a tinyint. Else, a mediumint. But even a 4-byte integer is less than an 8 character string. And even using a data column as primary key is acceptable.
    The advantage of this approach is that “every integer looks exactly like every other integer, but monikers never, ever will.” If the wrong kind of moniker is in some column, you can immediately detect it by eye. Or, by a query, e.g.:
    select dataset_moniker from datasets where dataset_moniker not lik +e 'dataset_%'
    How does that rhyme with what you write in your first paragraph: It should be a column whose values are utterly devoid of business or other meaning. If your primary column is utterly devoid of any business or other meaning, you would never use literals in your queries. Your primary columns would only show up in joins. Your example would never be issued.

      The “literal” content of the moniker is expressly designed to make the string more sensible to humans.   If the moniker simply consisted of a random string of characters, then it would effectively be “just like a large integer,” in the sense that, if you accidentally got one in the wrong place, you could never know it nor detect it.

      As you correctly and astutely observe, my phrase, utterly devoid of ... other meaning, is not strictly true, at least with one particular reading of the English word, “meaning.”   The random string, itself, means nothing.   The prefix is to be used only as a convention.

      It would be verboten to write logic that examined the value of the moniker and derived any sort of “actionable understanding” of the record’s content based on any prefix-string found there.   That would be an egregious violation of normal-forms.   But to make key-strings convey information that is useful for debugging and for automatic data-integrity verification ... that, I submit, is useful.

      I agree with your point about page-splits.   You are absolutely correct that this strategy comes with costs, and this is one of them.   (If your database supports hashed, vs. B-tree, indexes, and if you elect to use them due to reasons other than just this, then that cost can be somewhat relieved.)

Re: An improved technique for database primary keys
by herveus (Parson) on Nov 05, 2010 at 12:54 UTC
    Howdy!

    Every relational database table should, of course, have a primary key. This should be a column whose only purpose in life is to uniquely identify the row, and this strictly for the database’s own, internal purposes. It should be a column whose values are utterly devoid of business or other meaning.

    I beg to differ. Every table should have a primary key. However, that key can (and often should) be comprised of multiple columns. What you describe here is the use of surrogate keys. Sometimes you have no real choice but to use a surrogate key, but they are frequently misapplied. Even if you do use a surrogate key, there should be a set of columns whose values identify the row. Those columns are the natural key. If any of those columns are permitted to be NULL, they can't be a primary key. The best you can do with them is make a UNIQUE index on them.

    The risk with surrogate keys (especially if the natural key is ignored) is that you will get duplicate data. I've seen that in the wild. Use surrogate keys with care, if you must, but remember to identify the alternate key and enforce it.

    Aside from that major quibble, I do like your idea.

    yours,
    Michael

        “Keyvil...”   :-D   I like it.   Thank you very much for the interesting links.

      You are correct:   these are, indeed, “surrogate” keys.

      “Constants aren’t.   Variables don’t.   Unique numbers aren’t (at least not if any human being has anything to do with it).   Even computer-generated “non-repeating” numbers sometimes do.   Plan accordingly.”

      As a general design principle, I routinely elect to use surrogate keys as the record-identifiers, i.e. for table-JOIN purposes, and to use UNIQUE indexes for any of the “natural” uses that you describe.

      I learned to do this when working on a very early project for an insurance company which had converted from an original paper-based provider-ID numbering system.   The same provider-ID was assigned to more than one provider, and providers had more than one ID.   Furthermore, business requirements made it impossible for this situation to be changed, because these numbers were widely-distributed in the field.   My strategy involved the use of surrogate keys.   These keys were “nonsensical character strings,” much like these, and they were never, ever divulged to the user.   The application stored the provider-ID number that had been entered, which was known to be ambiguous, and it resolved that ambiguity (by various rules and by user-input) and stored the corresponding (unambiguous) surrogate-key in another column.   All table linking took place using the surrogate columns.   This approach did successfully solve what was up to that time a very thorny business problem.

        Howdy!

        ...and you describe exactly the kind of situation where a surrogate key becomes necessary.

        I simply object to the terminology which implies that primary keys are, necessarily, opaque values stored in a single column. Changing the title to refer to surrogate keys would fix that.

        yours,
        Michael

      Multicolumn primary keys are ... a major pita as soon as you need to do any joins. I had to write some queries for a schema that used them and it was horrible and error prone. It was way too easy to forget one of the three columns in the sixth join. Plus of course having to repeat all those columns in all those related tables is wasteful.

      If you want to ensure something is unique, use a unique index/constraint, but please do not ever even think of using a multicolumn PK!

      Update: I forgot to add ... unless the table is basically just implementation of an N-N relation. With or without additional columns. In that case you are unlikely to need to join using both columns at the same time (see ... you can't assume you'll always include the whole PK in all joins) and the surrogate ID would (almost) never be used.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        Howdy!

        It's a tradeoff. If you use surrogate keys, you always have to join to get the actual key values. Multicolumn PKs are natural. Surrogate keys obfuscate the true structure. Foreign key constraints are essential to document the key structures and to enforce referential integrity. They also help query building tools get the joins right.

        yours,
        Michael
Re: An improved technique for database primary keys
by Tux (Monsignor) on Nov 05, 2010 at 14:23 UTC

    Would the title, and the tone of your post, would have been "An alternate technique ..." instead of "An improved technique ...", I would have had a completely different feeling.

    I really abhor your thoughts now. This goes straight into colision course with normalization. There is absoloutely no need for surrogate keys if the primary key is unique and used as such. In many many many occasions, the real key indead has a meaning, or even better, a centralized (or decentralized looking from the opposite site) location where the "base" table values can be fetched for reference. Think of the prefix for phone numbers for countries. In those cases the keys are obvious and logical. Dialing +31 will get you to the Netherlands, and that is very unlikely to change. In a table that would store the country name for prefixes, the keys should simple be 31.


    Enjoy, Have FUN! H.Merijn

      Poof...   your wish is my command.   I have no wish to be provocative in my selection of titles, nor in the tone of my posts.   (Thank you for the guidance.)   I changed the thread-title, although the child-records (existing replies) were not changed thereby.

      Nodding politely (but, flinching slightly) at your use of the word, “abhor,” I find it to very often be the case that no “human provided” number serves well as a primary-key.   As the subject of a “UNIQUE-indexed column,” perhaps, but not as “the glue that sticks everything together.”

      I say this because “unique numbers” in the human world are often not perfectly unique, or if they are, do not stay that way.   And when (not if...) one of these eventualities happen, they are very problematic for the computer.   An account-number change, for instance, that “ripples” through thousands of records, for instance, and/or that invalidates an archive if the account is of very long standing.   An assigned number (in-use by a human) that cannot be stored.   Or, as I related above, a number that does not have the one-to-one correspondence with reality that it should (according to the good Dr. Codd) have.   In those situations, surrogate (or “synthetic”) keys might excel.   You are obliged to handle, and to continue to handle, a real-world circumstance that is “mathematically speaking, compromised.”   The business isn’t going to stop for a schema.   I dealt with that first-hand, and yes, it shaped my thoughts.

      Of course it is a choice to be made; not an absolute maxim.   “So help me, Codd.”

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://869612]
Approved by mr_mischief
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (17)
As of 2014-07-14 15:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (267 votes), past polls