|Problems? Is your data what you think it is?|
(OT) Why SQL Sucks (with a little Perl to fix it)by Ovid (Cardinal)
|on Dec 11, 2005 at 01:23 UTC||Need Help??|
Have you ever gotten more results from an SQL query than you thought you should? Have those results often been duplicates? Do you rewrite your query only to discover that you get a different set of duplicates? Does this annoy you? Then read on.
I finally got my grubby little hands on a copy of Database In Depth by C. J. Date. I've just finished Chapter 3 and so far, I've been blown away. My irritations with the shortcomings of databases have been wrong. They're generally shortcomings with SQL and people's (mis)understanding of relational theory.
The main gist of the book is simple: database management systems (DBMS) don't pay attention to relational theory. Relational theory is grounded in set theory and predicate logic. These are areas of math whose origins go back to the ancient Greeks. They are very well understood so it's sort of a mystery to me why database vendors, instead of falling back on knowledge which has stood the test of time, insist upon following the herd and trying to hack in "fixes" to broken solutions rather than investigating root causes.
For example, set theory says, amongst other things, that the set of integers (1,2,3) is the same as the set of integers (2,3,1). The order of items in a set is irrelevant. Further, the set of integers (1,2,3) cannot be duplicated. Just as there is only one "17", there is only one "(1,2,3)". Thus, if a you have a database query which returns "(1,2,3)" and "(1,2,3)", it's really not respecting set theory. It's actually returning a multiset, also known as a "bag". But so what? Why does this matter?
Consider the following two database tables:
Now right off the bat, if you know the rules of normalization you know these tables are not properly designed. However, for the sake of simplicity, bear with me.
Let's say that we want to find all part numbers for parts which are either screws, supplied by "S1", or both. It's pretty clear from looking at the tables the answer is "P1" and "P2". How might we write a query which answers this?
And that returns 3 P1s and a P2. Hmm, that's not right. How about:
Two P1s and a P2.
Ah, much better. Now we get nine P1s and three P2s.
At this point, I imagine there are quite a few monks thinking "idiot. Just use SELECT DISTINCT". So let's try that out with a little Perl program which runs three SQL queries, shows their results and then reruns them with "DISTINCT":
And the output (minus the SQL):
Regular SQL: Results: P1: 3 P2: 1 Results: P1: 2 P2: 1 Results: P1: 9 P2: 3 DISTINCT: Results: P1: 1 P2: 1 Results: P1: 1 P2: 1 Results: P1: 1 P2: 1
As you can see, by forcing the SQL to only return sets instead of bags, we consistently get correct results regardless of how we structure our query.
In fact, one of the founders of relational theory, C.J. Date, recommends that every SQL SELECT statement be written as SELECT DISTINCT .... Unfortunately, many folks get upset with this for a variety of reasons. First, DBMS do a horrible job of optimizing DISTINCT queries. There are many cases where the DISTINCT can be safely ignored but in practice, using DISTINCT with large sets of data and complicated queries can have an abysmal impact on performance. I think this can be put into the "premature optimization" category, though. Until we know that DISTINCT is causing an impact, isn't it better to improve the chance of getting correct results and worry about performance afterwards?
The second major objection is more subtle. With a properly designed database, we shouldn't have duplicate rows in the database. The first problem with that is obvious. Take the tables I show above and add a few arbitrary columns to each. The SQL generates the same results. However, even if you struggle to eliminate repeated subsets of data in your rows, you still have the fundamental problem that you're working with SQL. Imagine the following query:
It's clear that we can easily get duplicate cities in the results. Thus, once again, SELECT DISTINCT is your friend.
The solution? Write all of your SELECT statements with DISTINCT and only remove the DISTINCT when you have a known performance problem.
Update: I should point out that the above advice pertains to duplicate data. That sounds redundant, but what if you have the city of Paris, France and the city of Paris, Texas? Just because the name is the same does not meet that the data they represent are the same. In this case, it's important to SELECT DISTINCT enough information to distinguish between things which look identical but aren't.