http://www.perlmonks.org?node_id=359214

Title: Normalisation

Normalisation appears to be widely misunderstood and misrepresented. Inaccurate, misleading, and outright bogus definitions are common, and it seems that even reputable organisations have got it wrong, for instance MySQL AB gets it wrong in their document An Introduction to Database Normalisation.

In this post I want to provide an unequivocal definition based primarily on the writings of C.J. Date.

Although I hope that this definition is accurate, it might stir some controversy, since a number of monastarians appear to have absorbed misinformation, and one has even produced a CPAN module that claims to transform a MySQL table "from 1st to 2nd normal form", when in fact it does no such thing.

I should start by equating some formal and informal terms. Note that the formal terms have a precise definition that is stricter and usually means more than the informal approximations. For example, the definition of a Relation precludes the existence of duplicate Tuples (Rows), whereas the definition of a Table does not. This distinction between the formal and informal is, I think, an important point, and it may explain some of the confusion.

Formal nameInformal name or meaning
RelationTable (structure + data)
Relation Variable (relvar)Table (structure only)
TupleRow or record (structure + data)
DomainData Type
AttributeColumn or Field

It's important that we understand this vocabulary, so here are some pictures.

This is a Relation (Table) of Suppliers

    
    +----+---------+--------+--------+
    | ID | SURNAME | STATUS | CITY   |
    +----+---------+--------+--------+
    |  1 |   Smith |     20 | London |
    |  2 |   Jones |     10 | Paris  |
    |  3 |   Blake |     30 | Paris  |
    |  4 |   Clark |     20 | London |
    |  5 |   Adams |     30 | Athens |
    +----+---------+--------+--------+

And this is a Tuple (Row)


    +----+---------+--------+--------+
    | ID | SURNAME | STATUS | CITY   |
    +----+---------+--------+--------+
  +------------------------------------+ 
  | |  1 |   Smith |     20 | London | |-- Tuple (row or record)
  +------------------------------------+  
    |  2 |   Jones |     10 | Paris  |
    |  3 |   Blake |     30 | Paris  |
    |  4 |   Clark |     20 | London |
    |  5 |   Adams |     30 | Athens |
    +----+---------+--------+--------+

Here's another Tuple:


    +----+---------+--------+--------+
    | ID | SURNAME | STATUS | CITY   |
    +----+---------+--------+--------+
    |  1 |   Smith |     20 | London |
  +------------------------------------+  
  | |  2 |   Jones |     10 | Paris  | |-- Another Tuple
  +------------------------------------+  
    |  3 |   Blake |     30 | Paris  |
    |  4 |   Clark |     20 | London |
    |  5 |   Adams |     30 | Athens |
    +----+---------+--------+--------+

These are the Attributes (Columns):

 
                               Name  Domain
                                 |      |
  +------------Attributes--------|------|--------------------+
  | +--------+---------------+---|------|-----+------------+ |
  | | ID: ID | SURNAME: NAME | STATUS: STATUS | CITY: CITY | |
  | +--------+---------------+----------------+------------+ |
  +----------------------------------------------------------+  
    |      1 |         Smith |          20 |     London |
    |      2 |         Jones |          10 |     Paris  |
    |      3 |         Blake |          30 |     Paris  |
    |      4 |         Clark |          20 |     London |
    |      5 |         Adams |          30 |     Athens |
    +--------+---------------+-------------+------------+

Notice that each Attribute has both a Name and a Domain. That is the definition of an Attribute.

Domains (Data Types) represent the finite set of legal values for an attribute.

Finally, a Relvar (Table Structure) is the unordered collection of Attributes (not including the data).


  +------------Relvar of Suppliers----------------------------+
  | +---------+---------------+----------------+------------+ |
  | | ID: INT | SURNAME: NAME | STATUS: STATUS | CITY: NAME | |
  | +---------+---------------+----------------+------------+ |
  +-----------------------------------------------------------+  

Now for the definition of Normalisation.

Normalisation

Normalisation is the process of redesigning table structures (relvars) to make them comply with normal forms. The end goal of normalisation is a set of tables that each contain exactly one type of entity.

Normalisation is not directly concerned with storage efficiency or scalability. It formalises some good design principles, but it is not the entirety of good database design.

There are many normal forms, but I intend to give definitions of these normal forms only up to third normal form, since the rest are less relevant, and in any case that isn't where the bulk of confusion appears to lie.

Each successive normal form builds upon its predecessor. To say that a relvar is in third normal form is also to imply that the relvar is in first and second normal forms. Higher normal forms are more desirable than lower normal forms, in the sense that they are closer to the end goal of normalisation.

First Normal Form (1NF)

A relvar is in 1NF if and only if, in every legal value of that relvar, each tuple contains exactly one value for each attribute.

Informally; a table is in First Normal Form if every column of each row contains exactly one indivisable value.

"Indivisable" is context-sensitive. For instance, it probably isn't sensible to store the individual bits of an integer, but it might be sensible to divide a person's name into forename and surname. In some cases it might be sensible to divide a date into year, month, day etc. In most cases we shouldn't store a list of email addresses in a single attribute. Like a lot of things, it depends on the circumstances.

In contrast to the above definition, here are some incorrect definitions of first normal form

WRONG The First Normal Form (or 1NF) involves removal of redundant data from horizontal rows. We want to ensure that there is no duplication of data in a given row, and that every column stores the least amount of information possible (making the field atomic). - Mike Hillyer, An Introduction to Database Normalisation (MySQL AB)
WRONG The purpose of first normal form (1NF) is to eliminate repeating groups of attributes in an entity. - David Besche, MSCE Training Guide: SQL Server 7 Database Design.
WRONG Eliminate duplicative [sic] columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key). - Mike Chapple, Database Normalization Basics
WRONG A table is in first normal form (1NF) if there are no repeating groups. A repeating group is a set of logically related fields or values that occur multiple times in one record. Normalising Your Database

Second Normal Form (2NF)

It's worth noting that second normal form is only applicable where the primary key of a table is composed of more than one attribute. Keys composed of more than one attribute are known as composite keys.

To repeat this important point; if the primary key is not a composite key, 2NF does not apply.

Here's the definition

A relvar is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key.

Informally; a table is in second normal form if every non-primary-key attribute of that table is dependant on the whole of the primary key, that is; every non-primary-key attribute is identified by the whole of the primary key.

The term "X depends on Y" means that a relationship exists between attributes X and Y such that a given value of Y always implies a value of X. For example, a SCHOOLID of "9876" always implies a SCHOOLNAME of "Wellington". This relationship is formally known as a Functional Dependence.

Here's an example of a table that violates 2NF:


+~~~~~~~~~~~~~~~~~~~~~~+---------------+---------+------------+
| STUDENTID | SCHOOLID | DATEOFARRIVAL | SURNAME | SCHOOLNAME |
+-----------+----------+---------------+---------+------------+
|      1234 |      999 | 2005-09-01    | Smith   | Wellington |
+~~~~~~~~~~~~~~~~~~~~~~+---------------+---------+------------+

In this table, the primary key is a composite of STUDENTID and SCHOOLID, and the purpose of the table is to store the date upon which a Student first arrives at a School.

The 2NF violation exists because SURNAME and SCHOOLNAME depend on only part of the primary key.

To achieve 2NF, SURNAME and SCHOOLNAME should be removed from this table (and placed elsewhere).

Third Normal Form (3NF)

A relvar is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.

Informally; a table is in third normal form if all its non-primary-key attributes do not depend on any other non-primary-key attribute.

Here's an example of a table that violates 3NF:


+~~~~~~~~~~~+---------+------------+----------+-------------+
| STUDENTID | SURNAME |	       DOA | SCHOOLID |  SCHOOLNAME |
+~~~~~~~~~~~+---------+------------+----------+-------------+
|      1234 |   Smith | 2005-09-01 |     9876 |  Wellington | 
+~~~~~~~~~~~+---------+------------+----------+-------------+

This table violates 3NF because SCHOOLNAME depends on SCHOOLID, which is not the primary key.