|Perl Monk, Perl Meditation|
This post describes how to normalize (to Second Normal Form) a table and presents a specific Perl script to migrate the data from the original table into the new tables. The underlying database engine used is mysql.
Because you don't want redundant data in your database do you? This could cause scalability issues down the road, because your database WILL grow in size. More importantly however, is the simple fact that redundancy leads to errors. Imagine storing the name of a course (10,000 of them) along side the student taking the course, then deciding to change the name of that course. If you had built a seperate table of courses and link the students to a 'course id' instead, then this task would be trivial. What if some courses were spelled 'CS101' and others 'CS0101'? Not fun in the former's case!
The original table was a test table that i used from time
to time. I populated it with info
First Normal Form requires that that a table have no repeating groups and no multi-valued attributes. Diagram 2 displays a design that breaks First Normal Form:
Here you see that an album's song titles are lumped together in one field. This is bad because you need more than SQL to seperate the values. This is violating the 'no multi-attributes' rule of First Normal Form.
How about Second Normal Form? Diagram 1 is not in Second Normal Form, which requires that any non-key field be dependent upon the entire key. The name of the album has no dependency upon the artist, also the name of the song has no dependency upon the album. Notice that i say 'name' - i am not saying that a song is not dependent upon the album it belongs to or the artist that the album belongs to - just those extra attributes of a song and an album are not dependent. The key to understanding Second Normal Form is that you will almost always refer to a row by a unique id number that really has no meaning other than to provide uniqueness.
This is the guts of this discussion - migrating a database table from First Normal Form to Second.
The first step is to analyze your existing database table. Here is the CREATE statement i used (using mysql):
I am skipping indexes as they are off-topic. Now, what attributes are NOT dependent upon the primary key? Title refers the song's title, it is dependent. Artist is the name of the artist, it is not dependent. Album is the name of the album, it is not dependent - same for year.
So, our new songs table - which i am going to rename to song - should look like this:
But how do you associate a song to the album it belongs to. You need a foreign key. Before i discuss foreign keys, we should back up and approach our new design 'top-down' instead of 'bottom-up'.
So, let's instead start at the top of the heirarchy - artist:
This should be more digestable - all you have is a unique id (that is generated by the database) that uniquely identifies each row, and an attribute to store the name of the artist. You could go one step further and require each name to be unique, but again, that's beyond the scope of this discussion.
The ablum table looks like this:
Same idea as artist - but this time we have a foreign key (artist_id) to associate an album row to an artist. An artist has albums, an album belongs to an artist. The foreign key artist_id is the value of the id column from the artist table that this row (this album) belongs to. The FOREIGN KEY declaration tells the database that it should check that this value exists before attempting to insert it into the album table. Since mysql still does not support enforced referential integrity, this statement is uneccesary, but other database vendors do support it, so i have including it. index_artist_id is an index that is created as a side effect, but that's the last time i will mention indexes.
Finally, the song table looks like this:
Very similar to the album table. You could other fields if you like, such as track_time or track_number - these directly relate to a song, so their inclusion will not break Second Normal Form. Including something like songwriter would, however - if the entire band owns the songwriter credits, then all you need to do is include the band id number. To add a finer grain you might need to implement a band_member table or such.
Now that we have our table definitions, lets look at what the contents of each table sould look like:
Notice how you can track the song 'So Cruel' to it's album (Achtung Baby) and (via the album table) to it's artist (U2):
Now that we see where we are going - let's get there! The idea is to get all the data from the existing First Normal Form table, munge it, then insert the data into the appropriate new Second Normal Form tables.
So, how do we munge the existing data? Refering back to Diagram 1, if we could build a data structure like this:
then we could first grab the first key and insert that into the artist table. Then we find the unique id that was assigned to the newly inserted row and use that as the foreign key for the albums. We do the same for each album, except we save the list of songs until after the new album has been inserted. Same thing - grab the newly assigned unique id and us it for the foreign key for the songs.
Confusing? How about some code? My database name is mp3.
And that's it!
Migrating a First Normal Form database table to multiple Second Normal Form tables is never a generic task - the techniques vary from table to table. The method i chose to derive my new tables was very specific to my needs, and as a result, my code 'as is' is only useful to these specific tables.
I should also note the normalization is not the ultimate solution for every database. Normalization forces JOIN's to be used in order to retrieve the equivalent row of the orginal table, and JOIN's can be quite expensive. In the real world, you will find many databases that have been DE-normalized in order to speed up certain critical queries. To tell the truth, most databases i have seen in the real world aren't even normalized in the first place, probably because JOIN syntax is complicated and intimidating. Compare these two queries:
The resulting SQL is
The first example could be 14 rows or 14,000,000 rows. The second example will always be 1.
There are many, many papers, tutorials, discussions, etc. available on the Web - here is a resource that i found particularly useful: http://www.palslib.com/Fundamentals/Database_Design.html.