Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
EdwardG's scratchpadby EdwardG (Vicar) |
on Jun 02, 2004 at 06:30 UTC ( [id://359214]=scratchpad: print w/replies, xml ) | Need Help?? |
Title: Normalisation Normalisation appears to be widely misunderstood and misrepresented. Inaccurate, misleading, and outright bogus definitions are common, and it seems that even reputable organisations have got it wrong, for instance MySQL AB gets it wrong in their document An Introduction to Database Normalisation. In this post I want to provide an unequivocal definition based primarily on the writings of C.J. Date. Although I hope that this definition is accurate, it might stir some controversy, since a number of monastarians appear to have absorbed misinformation, and one has even produced a CPAN module that claims to transform a MySQL table "from 1st to 2nd normal form", when in fact it does no such thing. I should start by equating some formal and informal terms. Note that the formal terms have a precise definition that is stricter and usually means more than the informal approximations. For example, the definition of a Relation precludes the existence of duplicate Tuples (Rows), whereas the definition of a Table does not. This distinction between the formal and informal is, I think, an important point, and it may explain some of the confusion.
It's important that we understand this vocabulary, so here are some pictures. This is a Relation (Table) of Suppliers +----+---------+--------+--------+ | ID | SURNAME | STATUS | CITY | +----+---------+--------+--------+ | 1 | Smith | 20 | London | | 2 | Jones | 10 | Paris | | 3 | Blake | 30 | Paris | | 4 | Clark | 20 | London | | 5 | Adams | 30 | Athens | +----+---------+--------+--------+ And this is a Tuple (Row) +----+---------+--------+--------+ | ID | SURNAME | STATUS | CITY | +----+---------+--------+--------+ +------------------------------------+ | | 1 | Smith | 20 | London | |-- Tuple (row or record) +------------------------------------+ | 2 | Jones | 10 | Paris | | 3 | Blake | 30 | Paris | | 4 | Clark | 20 | London | | 5 | Adams | 30 | Athens | +----+---------+--------+--------+ Here's another Tuple: +----+---------+--------+--------+ | ID | SURNAME | STATUS | CITY | +----+---------+--------+--------+ | 1 | Smith | 20 | London | +------------------------------------+ | | 2 | Jones | 10 | Paris | |-- Another Tuple +------------------------------------+ | 3 | Blake | 30 | Paris | | 4 | Clark | 20 | London | | 5 | Adams | 30 | Athens | +----+---------+--------+--------+ These are the Attributes (Columns): Name Domain | | +------------Attributes--------|------|--------------------+ | +--------+---------------+---|------|-----+------------+ | | | ID: ID | SURNAME: NAME | STATUS: STATUS | CITY: CITY | | | +--------+---------------+----------------+------------+ | +----------------------------------------------------------+ | 1 | Smith | 20 | London | | 2 | Jones | 10 | Paris | | 3 | Blake | 30 | Paris | | 4 | Clark | 20 | London | | 5 | Adams | 30 | Athens | +--------+---------------+-------------+------------+ Notice that each Attribute has both a Name and a Domain. That is the definition of an Attribute. Domains (Data Types) represent the finite set of legal values for an attribute. Finally, a Relvar (Table Structure) is the unordered collection of Attributes (not including the data). +------------Relvar of Suppliers----------------------------+ | +---------+---------------+----------------+------------+ | | | ID: INT | SURNAME: NAME | STATUS: STATUS | CITY: NAME | | | +---------+---------------+----------------+------------+ | +-----------------------------------------------------------+ Now for the definition of Normalisation. NormalisationNormalisation is not directly concerned with storage efficiency or scalability. It formalises some good design principles, but it is not the entirety of good database design. There are many normal forms, but I intend to give definitions of these normal forms only up to third normal form, since the rest are less relevant, and in any case that isn't where the bulk of confusion appears to lie. Each successive normal form builds upon its predecessor. To say that a relvar is in third normal form is also to imply that the relvar is in first and second normal forms. Higher normal forms are more desirable than lower normal forms, in the sense that they are closer to the end goal of normalisation. First Normal Form (1NF)Informally; a table is in First Normal Form if every column of each row contains exactly one indivisable value. "Indivisable" is context-sensitive. For instance, it probably isn't sensible to store the individual bits of an integer, but it might be sensible to divide a person's name into forename and surname. In some cases it might be sensible to divide a date into year, month, day etc. In most cases we shouldn't store a list of email addresses in a single attribute. Like a lot of things, it depends on the circumstances. In contrast to the above definition, here are some incorrect definitions of first normal form
Second Normal Form (2NF)It's worth noting that second normal form is only applicable where the primary key of a table is composed of more than one attribute. Keys composed of more than one attribute are known as composite keys. To repeat this important point; if the primary key is not a composite key, 2NF does not apply. Here's the definition Informally; a table is in second normal form if every non-primary-key attribute of that table is dependant on the whole of the primary key, that is; every non-primary-key attribute is identified by the whole of the primary key. The term "X depends on Y" means that a relationship exists between attributes X and Y such that a given value of Y always implies a value of X. For example, a SCHOOLID of "9876" always implies a SCHOOLNAME of "Wellington". This relationship is formally known as a Functional Dependence. Here's an example of a table that violates 2NF: +~~~~~~~~~~~~~~~~~~~~~~+---------------+---------+------------+ | STUDENTID | SCHOOLID | DATEOFARRIVAL | SURNAME | SCHOOLNAME | +-----------+----------+---------------+---------+------------+ | 1234 | 999 | 2005-09-01 | Smith | Wellington | +~~~~~~~~~~~~~~~~~~~~~~+---------------+---------+------------+ In this table, the primary key is a composite of STUDENTID and SCHOOLID, and the purpose of the table is to store the date upon which a Student first arrives at a School. The 2NF violation exists because SURNAME and SCHOOLNAME depend on only part of the primary key. To achieve 2NF, SURNAME and SCHOOLNAME should be removed from this table (and placed elsewhere). Third Normal Form (3NF)Informally; a table is in third normal form if all its non-primary-key attributes do not depend on any other non-primary-key attribute. Here's an example of a table that violates 3NF: +~~~~~~~~~~~+---------+------------+----------+-------------+ | STUDENTID | SURNAME | DOA | SCHOOLID | SCHOOLNAME | +~~~~~~~~~~~+---------+------------+----------+-------------+ | 1234 | Smith | 2005-09-01 | 9876 | Wellington | +~~~~~~~~~~~+---------+------------+----------+-------------+ This table violates 3NF because SCHOOLNAME depends on SCHOOLID, which is not the primary key.
|
|