http://www.perlmonks.org?node_id=129454


DESCRIPTION

This post describes how to normalize (to Second Normal Form) a table and presents a specific Perl script to migrate the data from the original table into the new tables. The underlying database engine used is mysql.


WHY NORMALIZE?

Because you don't want redundant data in your database do you? This could cause scalability issues down the road, because your database WILL grow in size. More importantly however, is the simple fact that redundancy leads to errors. Imagine storing the name of a course (10,000 of them) along side the student taking the course, then deciding to change the name of that course. If you had built a seperate table of courses and link the students to a 'course id' instead, then this task would be trivial. What if some courses were spelled 'CS101' and others 'CS0101'? Not fun in the former's case!


ORGINAL TABLE

The original table was a test table that i used from time to time. I populated it with info gleemed gleaned (thanks deprecated) from a Perl script via File::Find and MP3::Info. The problem was that I used a database table that is only First Normal Form at best:

# Diagram 1 +-----+-----------------+--------+-----------+------+ | id | title | artist | album | year | +-----+-----------------+--------+-----------+------+ | 494 | Isobel | Bjork | Telegram | 1996 | | 495 | Cover Me | Bjork | Telegram | 1996 | | 498 | Crying | Bjork | Debut | 1993 | | 499 | One Day | Bjork | Debut | 1993 | +-----+-----------------+--------+-----------+------+
First Normal Form requires that that a table have no repeating groups and no multi-valued attributes. Diagram 2 displays a design that breaks First Normal Form:
# Diagram 2 +-----+-----------------+--------+-----------+------+ | id | titles | artist | album | year | +-----+-----------------+--------+-----------+------+ | 494 | Isobel,Cover Me | Bjork | Telegram | 1996 | | 495 | Crying,One Day | Bjork | Debut | 1993 | +-----+-----------------+--------+-----------+------+
Here you see that an album's song titles are lumped together in one field. This is bad because you need more than SQL to seperate the values. This is violating the 'no multi-attributes' rule of First Normal Form.

How about Second Normal Form? Diagram 1 is not in Second Normal Form, which requires that any non-key field be dependent upon the entire key. The name of the album has no dependency upon the artist, also the name of the song has no dependency upon the album. Notice that i say 'name' - i am not saying that a song is not dependent upon the album it belongs to or the artist that the album belongs to - just those extra attributes of a song and an album are not dependent. The key to understanding Second Normal Form is that you will almost always refer to a row by a unique id number that really has no meaning other than to provide uniqueness.

This is the guts of this discussion - migrating a database table from First Normal Form to Second.


ANALYSIS

The first step is to analyze your existing database table. Here is the CREATE statement i used (using mysql):

CREATE TABLE songs ( id int(11) NOT NULL auto_increment, title varchar(65) default NULL, artist varchar(65) default NULL, album varchar(65) default NULL, year varchar(4) default NULL, PRIMARY KEY (id) );
I am skipping indexes as they are off-topic. Now, what attributes are NOT dependent upon the primary key? Title refers the song's title, it is dependent. Artist is the name of the artist, it is not dependent. Album is the name of the album, it is not dependent - same for year.

So, our new songs table - which i am going to rename to song - should look like this:

CREATE TABLE song ( id int(11) NOT NULL auto_increment, title varchar(65) default NULL, PRIMARY KEY (id) );
But how do you associate a song to the album it belongs to. You need a foreign key. Before i discuss foreign keys, we should back up and approach our new design 'top-down' instead of 'bottom-up'.

So, let's instead start at the top of the heirarchy - artist:

CREATE TABLE artist ( id int(11) NOT NULL auto_increment, name varchar(65) default NULL, PRIMARY KEY (id) );
This should be more digestable - all you have is a unique id (that is generated by the database) that uniquely identifies each row, and an attribute to store the name of the artist. You could go one step further and require each name to be unique, but again, that's beyond the scope of this discussion.

The ablum table looks like this:

CREATE TABLE album ( id int(11) NOT NULL auto_increment, title varchar(65) default NULL, year varchar(4) default NULL, artist_id int(11) NOT NULL, PRIMARY KEY (id), FOREIGN KEY index_artist_id(artist_id) REFERENCES artist(id) );
Same idea as artist - but this time we have a foreign key (artist_id) to associate an album row to an artist. An artist has albums, an album belongs to an artist. The foreign key artist_id is the value of the id column from the artist table that this row (this album) belongs to. The FOREIGN KEY declaration tells the database that it should check that this value exists before attempting to insert it into the album table. Since mysql still does not support enforced referential integrity, this statement is uneccesary, but other database vendors do support it, so i have including it. index_artist_id is an index that is created as a side effect, but that's the last time i will mention indexes.

Finally, the song table looks like this:

CREATE TABLE song ( id int(11) NOT NULL auto_increment, title varchar(65) default NULL, album_id int(11) NOT NULL, PRIMARY KEY (id), FOREIGN KEY index_album_id(album_id) REFERENCES album(id) );
Very similar to the album table. You could other fields if you like, such as track_time or track_number - these directly relate to a song, so their inclusion will not break Second Normal Form. Including something like songwriter would, however - if the entire band owns the songwriter credits, then all you need to do is include the band id number. To add a finer grain you might need to implement a band_member table or such.

Now that we have our table definitions, lets look at what the contents of each table sould look like:

mysql: select * from mp3.song limit 3; +----+--------------------------------+----------+ | id | title | album_id | +----+--------------------------------+----------+ | 1 | Until the End of the World | 1 | | 2 | So Cruel | 1 | | 3 | Tryin to Throw Your Arms Aroun | 1 | +----+--------------------------------+----------+ mysql: select * from mp3.album limit 3; +----+--------------+------+-----------+ | id | title | year | artist_id | +----+--------------+------+-----------+ | 1 | Achtung Baby | 1991 | 1 | | 2 | War | 1983 | 1 | | 3 | October | 1981 | 1 | +----+--------------+------+-----------+ mysql: select * from mp3.artist limit 3; +----+-------------+ | id | name | +----+-------------+ | 1 | U2 | | 2 | Bjork | | 3 | Pink Floyd | +----+-------------+
Notice how you can track the song 'So Cruel' to it's album (Achtung Baby) and (via the album table) to it's artist (U2):
mysql: select artist.name from artist inner join album on artist.id=album.artist_id inner join song on album.id=song.album_id where song.title='So Cruel'; +------+ | name | +------+ | U2 | +------+


MIGRATION

Now that we see where we are going - let's get there! The idea is to get all the data from the existing First Normal Form table, munge it, then insert the data into the appropriate new Second Normal Form tables.

So, how do we munge the existing data? Refering back to Diagram 1, if we could build a data structure like this:

$VAR1 = { 'Bjork' => { 'Debut' => { 'list' => [ 'Crying', 'One Day', 'Come To Me', 'Violently Happy' ], 'year' => '1993' }, 'Telegram' => { 'list' => [ 'Isobel', 'Cover Me', 'Army Of Me', 'Possibly Maybe' ], 'year' => '1996' } } 'U2' => { #etc... } };
then we could first grab the first key and insert that into the artist table. Then we find the unique id that was assigned to the newly inserted row and use that as the foreign key for the albums. We do the same for each album, except we save the list of songs until after the new album has been inserted. Same thing - grab the newly assigned unique id and us it for the foreign key for the songs.

Confusing? How about some code? My database name is mp3.

use strict; use DBI; # connect my $dbh = DBI->connect( qw(DBI:mysql:mp3:host user pass), { RaiseError => 1 } ); # get existing data as one big ole 2-d array my $songs = $dbh->selectall_arrayref(" select artist,album,title,year from songs "); # munge data into new data structure my %hash; foreach my $row (@$songs) { my ($artist,$album,$title,$year) = @$row; # auto-vivification is what makes this all so cool push @{$hash{$artist}->{$album}->{'list'}},$title; # getting the year is tricky - since i know they # are all the same - i can safely or-cache it ;) $hash{$artist}->{$album}->{'year'} ||= $year; } foreach my $artist (keys %hash) { # insert this artist - get unique id for albums $dbh->prepare(" insert into artist(name) values(?) ")->execute($artist); my $artist_id = last_id(); foreach my $album (keys %{$hash{$artist}}) { my $year = $hash{$artist}->{$album}->{'year'}; # insert this album - get unique id for songs $dbh->prepare(" insert into album(title,year,artist_id) values(?,?,?) ")->execute($album,$year,$artist_id); my $album_id = last_id(); foreach my $song (@{$hash{$artist}->{$album}->{'list'}}) { # insert this song $dbh->prepare(" insert into song(title,album_id) values(?,?) ")->execute($song,$album_id); } } } $dbh->disconnect; # mysql specific - other vendor's milleage will vary sub last_id { $dbh->selectall_arrayref(" select LAST_INSERT_ID() ")->[0]->[0] }
And that's it!


CONCLUSION

Migrating a First Normal Form database table to multiple Second Normal Form tables is never a generic task - the techniques vary from table to table. The method i chose to derive my new tables was very specific to my needs, and as a result, my code 'as is' is only useful to these specific tables.

I should also note the normalization is not the ultimate solution for every database. Normalization forces JOIN's to be used in order to retrieve the equivalent row of the orginal table, and JOIN's can be quite expensive. In the real world, you will find many databases that have been DE-normalized in order to speed up certain critical queries. To tell the truth, most databases i have seen in the real world aren't even normalized in the first place, probably because JOIN syntax is complicated and intimidating. Compare these two queries:

# orginal First Normal Form table mysql: select artist,album,title,year from songs order by artist,year,album,title limit 2; +----------------+-------------+-----------------+------+ | artist | album | title | year | +----------------+-------------+-----------------+------+ | 10,000 Maniacs | In My Tribe | A Campfire Song | 1987 | | 10,000 Maniacs | In My Tribe | Cherry Tree | 1987 | +----------------+-------------+-----------------+------+ 2 rows in set (0.02 sec) # new Second Normal Form tables mysql: select artist.name, album.title as album, song.title, album.year from artist inner join album on artist.id=album.artist_id inner join song on album.id=song.album_id order by artist.name,album.year,album.title,song.title limit 2; +----------------+-------------+-----------------+------+ | name | album | title | year | +----------------+-------------+-----------------+------+ | 10,000 Maniacs | In My Tribe | A Campfire Song | 1987 | | 10,000 Maniacs | In My Tribe | Cherry Tree | 1987 | +----------------+-------------+-----------------+------+ 2 rows in set (0.04 sec)
The resulting SQL is much(update: i should have been more careful with my words) a bit more complex, and the results are even a bit slower. Before you go off thinking that normalization is worse, see dws's excellent rebutal below. Also realize that with normalization, you can get at the pieces easier - what if you misspelled Bjork?
# 1st form mysql: update songs set name="Bjork" where name="Bork"; Rows matched: 14 Changed: 14 # 2nd form mysql: update artist set name="Bjork" where name="Bork"; Rows matched: 1 Changed: 1
The first example could be 14 rows or 14,000,000 rows. The second example will always be 1.

There are many, many papers, tutorials, discussions, etc. available on the Web - here is a resource that i found particularly useful: http://www.palslib.com/Fundamentals/Database_Design.html.

Good luck!

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)