comment on

Have you ever gotten more results from an SQL query than you thought you should? Have those results often been duplicates? Do you rewrite your query only to discover that you get a different set of duplicates? Does this annoy you? Then read on.

I finally got my grubby little hands on a copy of Database In Depth by C. J. Date. I've just finished Chapter 3 and so far, I've been blown away. My irritations with the shortcomings of databases have been wrong. They're generally shortcomings with SQL and people's (mis)understanding of relational theory.

The main gist of the book is simple: database management systems (DBMS) don't pay attention to relational theory. Relational theory is grounded in set theory and predicate logic. These are areas of math whose origins go back to the ancient Greeks. They are very well understood so it's sort of a mystery to me why database vendors, instead of falling back on knowledge which has stood the test of time, insist upon following the herd and trying to hack in "fixes" to broken solutions rather than investigating root causes.

For example, set theory says, amongst other things, that the set of integers (1,2,3) is the same as the set of integers (2,3,1). The order of items in a set is irrelevant. Further, the set of integers (1,2,3) cannot be duplicated. Just as there is only one "17", there is only one "(1,2,3)". Thus, if a you have a database query which returns "(1,2,3)" and "(1,2,3)", it's really not respecting set theory. It's actually returning a multiset, also known as a "bag". But so what? Why does this matter?

Consider the following two database tables:

+-------------+ +------------------+
| Products    | | Supplier_Product |
+-------------+ +------------------+
| PNO | PNAME | | SNO     | PNO    |
+-----+-------+ +---------+--------+
| P1  | Screw | | S1      | P1     |
| P1  | Screw | | S1      | P1     |
| P1  | Screw | | S1      | P2     |
| P2  | Screw | +------------------+
+-------------+
[download]

Now right off the bat, if you know the rules of normalization you know these tables are not properly designed. However, for the sake of simplicity, bear with me.

Let's say that we want to find all part numbers for parts which are either screws, supplied by "S1", or both. It's pretty clear from looking at the tables the answer is "P1" and "P2". How might we write a query which answers this?

SELECT parts.pno 
FROM   parts
WHERE  parts.pname = 'Screw'
   OR  parts.pno IN
      (SELECT supplier_parts.pno
       FROM   supplier_parts
       WHERE  supplier_parts.sno = 'S1')
[download]

And that returns 3 P1s and a P2. Hmm, that's not right. How about:

SELECT supplier_parts.pno 
FROM   supplier_parts
WHERE  supplier_parts.sno = 'S1'
   OR  supplier_parts.pno IN
      (SELECT parts.pno
       FROM   parts
       WHERE  parts.pname = 'Screw')
[download]

Two P1s and a P2.

SELECT parts.pno 
FROM   parts, supplier_parts
WHERE  ( supplier_parts.sno = 'S1' AND
         supplier_parts.pno = parts.pno )
   OR  parts.pname = 'Screw'
[download]

Ah, much better. Now we get nine P1s and three P2s.

At this point, I imagine there are quite a few monks thinking "idiot. Just use SELECT DISTINCT". So let's try that out with a little Perl program which runs three SQL queries, shows their results and then reruns them with "DISTINCT":

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;
use DBI;
my $DATABASE = 'distinct.db';
my $P        = 'parts';
my $SP       = 'supplier_parts';

my $dbh = DBI->connect( "dbi:SQLite:dbname=$DATABASE", "", "" );

create_tables($dbh);
insert_data($dbh);

display_results(select_statements());
display_results(distinct_select_statements());

unlink $DATABASE or die "Could not unlink $DATABASE: $!";

sub display_results {
    my @sql_statements = @_;
    foreach my $select ( @sql_statements ) {
        my @results = @{ $dbh->selectall_arrayref($select) };
        my %results_for;
        foreach my $result (@results) {
            no warnings 'uninitialized';
            $results_for{$select}{ $result->[0] }++;
        }
        while ( my ( $sql, $results ) = each %results_for ) {
            print "$sql\n\tResults: P1: $results->{P1} P2: $results->{
+P2}\n\n    --\n";
        }
    }
}

sub create_tables {
    my $dbh = shift;
    $dbh->do("CREATE TABLE $P ( pno, pname )");
    $dbh->do("CREATE TABLE $SP ( sno, pno )");
}

sub insert_data {
    my $dbh = shift;
    $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')");
    $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')");
    $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')");
    $dbh->do("INSERT INTO $P VALUES ('P2', 'Screw')");

    $dbh->do("INSERT INTO $SP VALUES ('S1', 'P1')");
    $dbh->do("INSERT INTO $SP VALUES ('S1', 'P1')");
    $dbh->do("INSERT INTO $SP VALUES ('S1', 'P2')");
}

sub distinct_select_statements {
    my @statements = map { s/SELECT/SELECT DISTINCT/g; $_ } select_sta
+tements();
}

sub select_statements {
    my @statements;
    push @statements => <<"    END_SQL";
    SELECT $P.pno 
    FROM   $P
    WHERE  $P.pname = 'Screw'
    OR     $P.pno IN
        (SELECT $SP.pno
         FROM   $SP
         WHERE  $SP.sno = 'S1')
    END_SQL
    push @statements => <<"    END_SQL";
    SELECT $SP.pno 
    FROM   $SP
    WHERE  $SP.sno = 'S1'
    OR     $SP.pno IN
        (SELECT $P.pno
         FROM   $P
         WHERE  $P.pname = 'Screw')
    END_SQL
    push @statements => <<"    END_SQL";
    SELECT $P.pno 
    FROM   $P, $SP
    WHERE  ( $SP.sno = 'S1' AND
             $SP.pno = $P.pno )
       OR  $P.pname = 'Screw'
    END_SQL
    return @statements;
}
[download]

And the output (minus the SQL):

Regular SQL:

   Results: P1: 3 P2: 1
   Results: P1: 2 P2: 1
   Results: P1: 9 P2: 3

DISTINCT:

   Results: P1: 1 P2: 1
   Results: P1: 1 P2: 1
   Results: P1: 1 P2: 1

As you can see, by forcing the SQL to only return sets instead of bags, we consistently get correct results regardless of how we structure our query.

In fact, one of the founders of relational theory, C.J. Date, recommends that every SQL SELECT statement be written as SELECT DISTINCT .... Unfortunately, many folks get upset with this for a variety of reasons. First, DBMS do a horrible job of optimizing DISTINCT queries. There are many cases where the DISTINCT can be safely ignored but in practice, using DISTINCT with large sets of data and complicated queries can have an abysmal impact on performance. I think this can be put into the "premature optimization" category, though. Until we know that DISTINCT is causing an impact, isn't it better to improve the chance of getting correct results and worry about performance afterwards?

The second major objection is more subtle. With a properly designed database, we shouldn't have duplicate rows in the database. The first problem with that is obvious. Take the tables I show above and add a few arbitrary columns to each. The SQL generates the same results. However, even if you struggle to eliminate repeated subsets of data in your rows, you still have the fundamental problem that you're working with SQL. Imagine the following query:

SELECT city 
FROM   cities, customers
WHERE  customers.city_id = cities.city_id
  AND  customers.credit_rating > 700
[download]

It's clear that we can easily get duplicate cities in the results. Thus, once again, SELECT DISTINCT is your friend.

The solution? Write all of your SELECT statements with DISTINCT and only remove the DISTINCT when you have a known performance problem.

For added fun, read why NULLs cause problems in databases. (Link fixed. Thanks hossman!)

Update: I should point out that the above advice pertains to duplicate data. That sounds redundant, but what if you have the city of Paris, France and the city of Paris, Texas? Just because the name is the same does not meet that the data they represent are the same. In this case, it's important to SELECT DISTINCT enough information to distinguish between things which look identical but aren't.

Cheers,
Ovid

New address of my CGI Course.

In reply to (OT) Why SQL Sucks (with a little Perl to fix it) by Ovid

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks