Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

(OT) Why SQL Sucks (with a little Perl to fix it)

by Ovid (Cardinal)
on Dec 11, 2005 at 01:23 UTC ( #515776=perlmeditation: print w/ replies, xml ) Need Help??

Have you ever gotten more results from an SQL query than you thought you should? Have those results often been duplicates? Do you rewrite your query only to discover that you get a different set of duplicates? Does this annoy you? Then read on.

I finally got my grubby little hands on a copy of Database In Depth by C. J. Date. I've just finished Chapter 3 and so far, I've been blown away. My irritations with the shortcomings of databases have been wrong. They're generally shortcomings with SQL and people's (mis)understanding of relational theory.

The main gist of the book is simple: database management systems (DBMS) don't pay attention to relational theory. Relational theory is grounded in set theory and predicate logic. These are areas of math whose origins go back to the ancient Greeks. They are very well understood so it's sort of a mystery to me why database vendors, instead of falling back on knowledge which has stood the test of time, insist upon following the herd and trying to hack in "fixes" to broken solutions rather than investigating root causes.

For example, set theory says, amongst other things, that the set of integers (1,2,3) is the same as the set of integers (2,3,1). The order of items in a set is irrelevant. Further, the set of integers (1,2,3) cannot be duplicated. Just as there is only one "17", there is only one "(1,2,3)". Thus, if a you have a database query which returns "(1,2,3)" and "(1,2,3)", it's really not respecting set theory. It's actually returning a multiset, also known as a "bag". But so what? Why does this matter?

Consider the following two database tables:

+-------------+ +------------------+ | Products | | Supplier_Product | +-------------+ +------------------+ | PNO | PNAME | | SNO | PNO | +-----+-------+ +---------+--------+ | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P2 | | P2 | Screw | +------------------+ +-------------+

Now right off the bat, if you know the rules of normalization you know these tables are not properly designed. However, for the sake of simplicity, bear with me.

Let's say that we want to find all part numbers for parts which are either screws, supplied by "S1", or both. It's pretty clear from looking at the tables the answer is "P1" and "P2". How might we write a query which answers this?

SELECT parts.pno FROM parts WHERE parts.pname = 'Screw' OR parts.pno IN (SELECT supplier_parts.pno FROM supplier_parts WHERE supplier_parts.sno = 'S1')

And that returns 3 P1s and a P2. Hmm, that's not right. How about:

SELECT supplier_parts.pno FROM supplier_parts WHERE supplier_parts.sno = 'S1' OR supplier_parts.pno IN (SELECT parts.pno FROM parts WHERE parts.pname = 'Screw')

Two P1s and a P2.

SELECT parts.pno FROM parts, supplier_parts WHERE ( supplier_parts.sno = 'S1' AND supplier_parts.pno = parts.pno ) OR parts.pname = 'Screw'

Ah, much better. Now we get nine P1s and three P2s.

At this point, I imagine there are quite a few monks thinking "idiot. Just use SELECT DISTINCT". So let's try that out with a little Perl program which runs three SQL queries, shows their results and then reruns them with "DISTINCT":

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; use DBI; my $DATABASE = 'distinct.db'; my $P = 'parts'; my $SP = 'supplier_parts'; my $dbh = DBI->connect( "dbi:SQLite:dbname=$DATABASE", "", "" ); create_tables($dbh); insert_data($dbh); display_results(select_statements()); display_results(distinct_select_statements()); unlink $DATABASE or die "Could not unlink $DATABASE: $!"; sub display_results { my @sql_statements = @_; foreach my $select ( @sql_statements ) { my @results = @{ $dbh->selectall_arrayref($select) }; my %results_for; foreach my $result (@results) { no warnings 'uninitialized'; $results_for{$select}{ $result->[0] }++; } while ( my ( $sql, $results ) = each %results_for ) { print "$sql\n\tResults: P1: $results->{P1} P2: $results->{ +P2}\n\n --\n"; } } } sub create_tables { my $dbh = shift; $dbh->do("CREATE TABLE $P ( pno, pname )"); $dbh->do("CREATE TABLE $SP ( sno, pno )"); } sub insert_data { my $dbh = shift; $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')"); $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')"); $dbh->do("INSERT INTO $P VALUES ('P1', 'Screw')"); $dbh->do("INSERT INTO $P VALUES ('P2', 'Screw')"); $dbh->do("INSERT INTO $SP VALUES ('S1', 'P1')"); $dbh->do("INSERT INTO $SP VALUES ('S1', 'P1')"); $dbh->do("INSERT INTO $SP VALUES ('S1', 'P2')"); } sub distinct_select_statements { my @statements = map { s/SELECT/SELECT DISTINCT/g; $_ } select_sta +tements(); } sub select_statements { my @statements; push @statements => <<" END_SQL"; SELECT $P.pno FROM $P WHERE $P.pname = 'Screw' OR $P.pno IN (SELECT $SP.pno FROM $SP WHERE $SP.sno = 'S1') END_SQL push @statements => <<" END_SQL"; SELECT $SP.pno FROM $SP WHERE $SP.sno = 'S1' OR $SP.pno IN (SELECT $P.pno FROM $P WHERE $P.pname = 'Screw') END_SQL push @statements => <<" END_SQL"; SELECT $P.pno FROM $P, $SP WHERE ( $SP.sno = 'S1' AND $SP.pno = $P.pno ) OR $P.pname = 'Screw' END_SQL return @statements; }

And the output (minus the SQL):

Regular SQL:

   Results: P1: 3 P2: 1
   Results: P1: 2 P2: 1
   Results: P1: 9 P2: 3

DISTINCT:

   Results: P1: 1 P2: 1
   Results: P1: 1 P2: 1
   Results: P1: 1 P2: 1

As you can see, by forcing the SQL to only return sets instead of bags, we consistently get correct results regardless of how we structure our query.

In fact, one of the founders of relational theory, C.J. Date, recommends that every SQL SELECT statement be written as SELECT DISTINCT .... Unfortunately, many folks get upset with this for a variety of reasons. First, DBMS do a horrible job of optimizing DISTINCT queries. There are many cases where the DISTINCT can be safely ignored but in practice, using DISTINCT with large sets of data and complicated queries can have an abysmal impact on performance. I think this can be put into the "premature optimization" category, though. Until we know that DISTINCT is causing an impact, isn't it better to improve the chance of getting correct results and worry about performance afterwards?

The second major objection is more subtle. With a properly designed database, we shouldn't have duplicate rows in the database. The first problem with that is obvious. Take the tables I show above and add a few arbitrary columns to each. The SQL generates the same results. However, even if you struggle to eliminate repeated subsets of data in your rows, you still have the fundamental problem that you're working with SQL. Imagine the following query:

SELECT city FROM cities, customers WHERE customers.city_id = cities.city_id AND customers.credit_rating > 700

It's clear that we can easily get duplicate cities in the results. Thus, once again, SELECT DISTINCT is your friend.

The solution? Write all of your SELECT statements with DISTINCT and only remove the DISTINCT when you have a known performance problem.

For added fun, read why NULLs cause problems in databases. (Link fixed. Thanks hossman!)

Update: I should point out that the above advice pertains to duplicate data. That sounds redundant, but what if you have the city of Paris, France and the city of Paris, Texas? Just because the name is the same does not meet that the data they represent are the same. In this case, it's important to SELECT DISTINCT enough information to distinguish between things which look identical but aren't.

Cheers,
Ovid

New address of my CGI Course.

Comment on (OT) Why SQL Sucks (with a little Perl to fix it)
Select or Download Code
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by jZed (Prior) on Dec 11, 2005 at 02:05 UTC
    While I found Date's writing about NULL interesting when I first read it, it seemed to me that the end conclusion was not "SQL sucks" but rather: understand NULLs and use them appropriately, differentiate between unknown, unavailable, and absent. Likewise it seems to me that your argument does not boil down to "SQL Sucks" but rather: understand DISTINCT and use it appropriately.

    But then again, I have a vested interest in defending the SQL faith and its holy tenets, so feel free to ignore me :-)

      I've been ranting about databases lately on my use.perl journal and have been struggling to figure out why SQL just doesn't quite seem fit the way my mind works. Now, by going back to the basics, I have a much better understanding. The problem is that relational theory appears to be fairly solid but SQL largely ignores it. Further, programmers who can't even properly describe 1NF are ranting about database problems when it's the interface to the database, SQL, which is causing the problems. (Note: until recently, I also had pretty serious misconceptions about 1NF).

      Maybe my work with logic programming has tainted me (and most Prolog databases seem poorly designed, too), but the frustrations I've had with SQL not really being set-oriented are being explained quite nicely by Date's work. The bugs and limitations I keep hitting are ones that are a direct consequence of SQL ignoring the relational model. So instead of being able to work with a declarative language that just lets us specify what we want, we have to spend an inordinate amount of time worrying about low-level details which the language could protect us from. Consider:

      SELECT DISTINCT c.name FROM customers c, orders o WHERE o.id = c.id AND o.amount > 1000

      If that's buried in a 100 line SQL statement that's going to be pretty easy to miss but it's almost certainly a bug. You probably want something like:

      WHERE o.cust_id = c.id

      In the relational model that query wouldn't even compile because the order id would be a different type than customer id (even though they both might look like integers) and a comparison between the two would be invalid. However, SQL is it's currently implemented doesn't care. Once again, we have an example of SQL's faulty implementation allowing bugs to occur which, if it respected the relational model, would cause the system to crash and burn pretty quickly rather than return spurious results. While it would still be possible to write bad queries, we'd have a huge subset of bugs go away.

      Cheers,
      Ovid

      New address of my CGI Course.

      I should also point out that current database implementations require databases to be in a consistent state within a transaction but can be in an inconsistent state within a transaction. Date's theory require that every atomic operation require the database to be in a consistent state at the end of the operation. By creating syntax which would allow this, many transactions would probably be unecessary, thus freeing us from yet another low-leve detail we shouldn't be worrying about.

      Of course, you have tons of practical experience here, so feel free to ignore me :)

      By the way, if you're going to be at the next PDX.pm meeting, I wouldn't mind getting together to chat about this.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by fergal (Chaplain) on Dec 11, 2005 at 02:07 UTC
    I read a fair chunk of Date and Darwen's "The Third Manifesto" where they describe what they want from a real relational database. It's great stuff but in order to get there it requires things that current RDBMs don't do and which are probably reasonably difficult to bolt on after the fact. Things like tuple-valued columns. That is having not just integer, varchar etc but also being able to do the equivalent of
    CREATE TABLE friends (Nickname VARCHAR(40), Name (First VARCHAR(20), L +ast Varchar(20))
    which would be great. You'd be able to use Name as the primary key even though current RDBMs would do it as 2 columns. You could
    SELECT * FROM friends WHERE Name=('John', 'Smith')
    and maybe even
    SELECT * FROM friends WHERE Name.First='John'
    but there are other interesting effects like for example what do you get with
    SELECT Nickname FROM friends WHERE Name.First='John'
    if 2 Johns have the same Nickname? Well you should get a list of nicknames without duplicates, but now the size of the result set is no longer equal to the number of Johns you know. SELECT Nickname, Name will give you a different number of row. So the query's behaviour now depends not just on the WHERE clause but also on the columns you've requested. Which could be confusing.

    It's interesting to note that while Date is very critical of SQL, he is (or was) a member of the standards committee. DB Debunk has more of Date's writings. Even if a truly relational database doesn't become a reality in the next 10 years, reading about what it would be like, why it should be like that and what's broken about the current crop can only improve your database-foo.

      After I finish this book, I really need to read "The Third Manifesto". I think I'll take a break with Harry Potter first, though :)

      Cheers,
      Ovid

      New address of my CGI Course.

        It's not an easy read. I stopped about half way through and never got back to it. There are parts I disagreed with, like their analysis of inheritance and the need to have a 2 separate imperative languages, one to deal with data definition and the other for application programming but it's got lots of interesting ideas in it. I got it from halfpricecomputerbooks or something like that for $10!
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by hossman (Prior) on Dec 11, 2005 at 02:23 UTC
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by renodino (Curate) on Dec 11, 2005 at 03:03 UTC
    The fact that DISTINCT is required to deal with tables (or DBMS's) which permit multisets shouldn't be a surprise, and hardly seems a cause for complaint. I could likewise complain that a Perl hash will destroy the old value when a tuple with a duplicate key is added; hence, "Perl sucks". All languages have behavioral limits. In your example, its a very logical behavior, given the operational criteria (i.e., using a DBMS that permits multisets).

    You may want to visit Database Debunkings on a regular basis. Mssr. Pascal is a well known critic (crank?) about SQL, so you may find a comrade to support your contentions.

    For added fun, read why NULLs cause problems in databases.

    As opposed to "why NULLs (aka undef's) never cause problems in <Perl | C | Java | ... >" ? NULL represents a state. It can be inconvenient, but using a DBMS that enforces integrity, and then using that capability, can often limit your exposure to the issue...but NULL exists for a reason (now theres a non sequitar!).

    As to your update, I don't really see the parallel. If you've designed a database which doesn't use unique keys, or don't bother to apply the complete key, then you'll likely get into trouble. Just as if you use a Perl hash with non-unique keys.

    I should declare that I'm solidly in the SQL fanboy camp (as if that wasn't obvious). I've been using SQL for 20 years, nearly as long as I've used C or Pascal, and a lot longer than Java or Perl. More importantly, I've used it on some very large (multiterabyte, giga-rowset) datasets and very complex problems. Fortunately, I've used it primarily on a DBMS with a very robust/feature rich SQL dialect.

    Most complaints I hear regarding SQL can usually be classified in one of the following categories:

    • The complainer lacks a full understanding or appreciation of set and relational theory
    • The complaint is made in relation to a weak SQL implementation

    (The best clue that either of these issues exists is a quick scan of their code: it usually consists of lots of row-at-a-time cursor operations in nested loops.)

    I noted that your examples were based on SQLite. While SQLite is convenient, it is hardly an example of a robust, feature rich SQL or relational algebra implementation. While its nice that you found a reason to use Perl to "fix" it, it might be a better idea in practice to use a more robust DBMS. I could likewise post an article titled, "Perl sucks (and here's some C to fix it)", and bury the fact that I'm trying to implement a realtime high frequency DSP algorithm in Perl. Not exactly an appropriate title. Perhaps you should update your OP title to "SQLite sucks at handling duplicate rows (and some Perl to fix it)" ?

      The only reason I used SQLite is that it's a quick-n-dirty way of showing a concept. The problems I show are not related to SQLite, they tend to be database agnostic. SQL is bag-oriented, not set-oriented. As a result, it's very easy to get incorrect results with it. Those are details which we should be hesitant to expose in language.

      As for the problems with NULLs, I've if you read the link I posted, I show simple examples of how NULLs in a database easily allow queries to return incorrect answers. I've also worked on extremely large databases with huge queries and NULLs which cause spurious results are notoriously difficult to track down. Unfortunately, some of the techniques to limit the impact of NULLs -- using 6NF and distributed foreign keys -- are either serious performance bottlenecks due to current DBMS implementations or are not natively supported in SQL.

      Of course, similar criticisms apply to Perl and yes, I've made them. Perl has some weaknesses too and I don't mind admitting that. However, Perl is trying to eliminate many of them with Perl6 (whether it will be successful remains to be seen). I don't see many database vendors going back to the underlying relational theory to fix these issues. I see them hacking things on top of SQL, bragging about their object support or other things which, it seems to me, would largely be moot if SQL was "fixed" (which I doubt it ever will be).

      Sure, we can point to programmers who have a poor understanding of relational or set theory and say the problem lies with them (often it does). But to force them to use a language which doesn't support these concepts very well isn't going to make the situation better.

      Now if you can provide concrete examples of why Date is wrong, I'd happily hear counter-arguments.

      Cheers,
      Ovid

      New address of my CGI Course.

        I'll take a shot at that challenge, although really Date is right. It's more a matter of the semantics of right and wrong here.

        DISTINCT issues: First off, your examples make use of silent joins and unnormalized design, which makes everything look worse than it is. Normalization + Explicit Joins eliminate the problem or make it very obvious why it happens. It happens every time you select rows of one table, then jump across an many-to-{many|one} relationship to another table and pull out column data from there. A person versed in SQL will know to always use DISTINCT in this case, and that it is unneccesary in any other case assuming a normalized design. It is arguable that the semantics of SELECT should be that it automatically DISTINCTs the results in these cases, as the case is easily machine-detectable. But I don't think that makes all of SQL broken. It's just a repetitive idiom SQL users have to learn, like most languages have.

        NULLs: Quite simply, for most designs, most columns should be set to "NOT NULL". End of problem. Enabling NULL values in a column without thinking through the implications (which are indeed precarious) will of course cause issues down the line. The same can be said of many ill-thought-out design-time decisions.

      The complainer lacks a full understanding or appreciation of set and relational theory

      The problem with SQL is that it requires "a full understanding or appreciation of set and relational theory" to use it properly, but offers itself to the world through it's "friendly" declarative syntax.

      It encourages people to think that they are competent because their queries return the results they expect. What they often do not understand is the costs that it went through to produce them, nor how much their query was optimised for them under the covers (in the case of the more sophisticated implementations--usually the commercial ones).

      Perl has a similar problem in that it's accessibility make it easy to 'get something to go' for the beginner without grounding, but without the grounding, they do not appreciate the redundancies and inefficiencies in how they get the results.

      SQL as a language is ill-fitting to the operations of set theory, with multiple, verbose, and often awkward and ambiguous ways of specifying relationships that would be concise and unambiguous in set notation. Many years ago I was trying to optimise a complex select across a 7-way join of some pretty large tables (for the time!). Consulting a DB2 expert from IBM he showed us an APL statement that did the equivalent selection; it consisted of about 15 to 20 characters. In total.

      If you think I'm kidding, go here and scroll down to table under the heading "How Does APL Compare with other Languages".

      I never learnt APL, but I've often wondered what could be achieved with if RDBMSs would accept it as an alternate syntax. Given that Unicode now supports the APL character set, maybe it's time for APL's resurrection :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        It appears your argument is at odds with itself. You contend that SQL requires a full understanding of set theory, and then contend it doesn't. My initial statement was made in a "wholistic" context (sorry for the cliche). If one intends to use multiset tables, they need to understand the issues it presents, and the syntax elements available to address those issues. But it is possible to craft data models which don't permit multisets, and hence SQL users can be blissfully ignorant of the multiset issue. Likewise with NULLs: either add constraints to disallow them, or translate them to some other value which has a more specific meaning.

        Perhaps the real issue isn't SQL, but rather the myriad ways in which data models can be constructed. There are reasons to implement data models in one manner or another, depending on the "profile" of the problem set it addresses, balanced by performance and resource costs, and the skills of the user community. The fact there isn't a "one true" and simple SQL syntax to accomodate every combination of data model requirements shouldn't be surprising. (Of course, proper application of views can mitigate the need for "SQL savvy" users).

        As to "incorrect" resultsets, I'd contend there is no such thing. There are, however,

        • poorly formed queries
        • badly designed data models
        • poorly cleansed databases

        I never learnt APL, but I've often wondered what could be achieved with if RDBMSs would accept it as an alternate syntax.

        I actually did learn APL waaaaay back in the long ago time, and it was a very intriguing language. I can imagine implementing very complex queries with a handful of characters. However, if you think SQL is hard to learn to use properly with complex data models, then APL will likely make some heads explode. Also, APL's arcane albeit concise syntax doesn't solve the problems inherent in dealing with large datasets built for multiple purposes and large, diverse user communities. I'd assert that an APL solution would also require an optimizer under those circumstances, unless one has the luxury of permuting and storing every index tuple within the dataset, in which case no optimizer is needed (other than a crafty APL coder with a deep understanding of the data model). APL by itself doesn't make disks spin faster, nor CPU's compute faster, nor RAM to automatically expand.

        As to the complaint regarding SQL's fitness, what particular elements do you take issue with ? JOIN ? UNION ? INTERSECT ? DIFFERENCE ? Subqueries to derive subsets ? CASE ? What elements are ambiguous ? What you perceive as verbose, I see as "readable" (and I've read some *very* large queries w/ 16+ joins, and numerous complex subqueries with several OLAP clauses in them). (FWIW, I usually eschew more "punctuation" driven Perl solutions in preference to readable equivalents). While GROUP BY/HAVING have been a particular nemesis of mine over the years, I've come to assimilate their usage, just like obtuse elements of other languages.

        If SQL is such a poor language, why does it continue to be used ? Wouldn't Darwinism have long ago found a replacement to overtake it ? (I'm aware there are contenders, but they haven't seemed to gain much of a foothold). E.g., Java managed to rival (and perhaps displace) C++ in a relatively short time - IMHO, for good reasons. For that matter, Python and Ruby have made significant entry into the Perl user base. Yet SQL is nearly 30 years old, and it supplanted other contenders, e.g. QUEL, QBE. If SQL, and the current crop of RDBMS's are so bad, why hasn't someone built the replacement and become famous and fabulously wealthy ?

      Most complaints I hear regarding SQL can usually be classified in one of the following categories:
      • The complainer lacks a full understanding or appreciation of set and relational theory
      • The complaint is made in relation to a weak SQL implementation
      That may or may not be true but Date is one of the authors of the SQL standard and one of the leading authors of RDBMS books, so I think we can assume that his criticisms are not those of a "crank".

      One thing that does annoy me about Darwen and Date's criticisms is that they are a bit too fond of making statements about facts they have already established and just refering the reader to the paper/article/expensive-book-that-they-just-happen-to-have-written in which they were established. They are struggling against a huge mainstream of standard SQL/RDBMS dogma they need to make their arguments easily accessible and not put the burden on the reader - that's fine for academia with it's well-stocked libraries but a bit much to expect of the casual reader who has not yet drunk their kool-aid.

        They are struggling against a huge mainstream of standard SQL/RDBMS dogma they need to make their arguments easily accessible and not put the burden on the reader - that's fine for academia with it's well-stocked libraries but a bit much to expect of the casual reader who has not yet drunk their kool-aid.

        FWIW, that's just what Database in Depth is designed to do. O'Reilly has posted a link to the first chapter, as well as an interview with Chris Date.

        —Theory

They do respect set theory!
by Celada (Monk) on Dec 11, 2005 at 03:16 UTC

    If you think about it just a little differently, they do respect set theory. Consider this.

    Tables and views are sets. The rows of the table are isomorphic to the elements of the set. A SELECT query is nothing more than a operation that takes the cross product of zero or more sets (FROM) and then filters this big set (WHERE). But what are the elements of these sets?

    The first thought is to consider that the elements are m dimensional vectors, where m is the number of columns in the table. Of course you can't have twice the same vector in a set (= you can't have two identical rows in a result) because that goes against what a set is! But if we use an obvious definition for vector equality, which is the vectors have the same dimension and the components are each equal to each other, we run into trouble from the start: the example tables you give cannot even exist in the first place! (since tables and views, not just results, are supposed to be sets)

    So how do we solve this? You have to think of each row in each table as unique among all rows of the same table. Pretend that each row (element of the set) is a unique objec with m attributes. It might have the same values for all attributes as some other row of the table, but it's still unique. Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key). Now everything works. The row that is the cross of the first row from parts and the second row from supplier_parts is not the same as the row that is the cross of the second row from parts and the second row from supplier_parts even though they look the same because they have all the same values in the column. Well, for that matter the first and second rows of the parts table aren't the same row even though they share all the same values.

    Suppose you reject the preceding paragraph as a terribly inelegant way to define uniqueness (and it is that, though it's the only way I know that works!). In doing so you restrict all your tables to contain only DISTINCT rows, or else they can not be interpreted as sets. You can force this by having a primary key in every table.

    What happens? Now SELECT queries never produce duplicate results, just as set theory demands. Note that you have to SELECT * to see that this is true. Selecting a proper subset of the columns is only an output filter which can give the illusion that there are identical rows. It doesn't run splice (hey! I brought it back on topic by mentioning a Perl operator!) on the underlying vectors. But DISTINCT will help you filter those illusory duplicates from the output.

    I'm not saying this is great, I'm just saying that's how you have to apply the set theory if you want it to work.

      Thank you. This is exactly the way I see it as well. I don't see this behaviour as a negative, in fact i see it as a positive. "Optimizing 'distinct' away" as ovid puts it is not done well because its not easy to do. And not making queries by default dupecheck means a temporary sort table or hash table doesn't have to be constructed, maintained and then thrown away for every query. The result set can be just piped back. For most well designed properly constrained databases there is no need for dupechecking, and when there is the need 'distinct' or 'group by' are waiting to handle it.

      BTW, IMO there are lots of reasons to bitch about SQL, but for me they are mostly syntax related. SQL was IMO not designed by a capable language designer, its syntax is horrible. (I'm not saying I could do better, but Im pretty sure there are folks out there that could.)

      ---
      $world=~s/war/peace/g

      It might have the same values for all attributes as some other row of the table, but it's still unique. Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key).

      At first blush this seems like a compelling argument and I've struggled to find just the perfect example to explain the problem. I can't, so I'll go with an imperfect example. First, I want to paraphrase what you said so that if I have misunderstood you, you can correct me.

      Consider the "supplier_parts" table:

      +------------------+ | Supplier_Product | +------------------+ | SNO | PNO | +---------+--------+ | S1 | P1 | | S1 | P1 | | S1 | P2 | +------------------+

      Your argument is that we can either infer the existence of a hidden "unique ID" or put an explicity primary key on the table to make the rows a set. You are right that this would turn the row into a set. However, it also violates 3NF. By substituting a clearer example, I can explain this better:

      +-----------------+ | City | Country | +-------+---------+ | Paris | France | | Paris | France | | Paris | USA | +-----------------+

      What you're suggesting is that I could do this:

      +----------------------+ | ID | City | Country | +----+-------+---------+ | 01 | Paris | France | | 02 | Paris | France | | 03 | Paris | USA | +----------------------+

      The bag of rows also constitute a set, but now, "Paris, France" is not dependent on the ID as required by 3NF. Assuming these truly represent the same city, are we referring to 01 or 02? Which one should other records point to? Merely adding a unique ID does appear to satisy the "set" constraint, but since it violates 3NF, we now have a denormalized database and cannot maintain data integrity.

      Cheers on catching my bad SQL, though :)

      Cheers,
      Ovid

      New address of my CGI Course.

        Ovid this reasoning is flawed. The table with an identity column violates normalization no more than the original table did. All it does is give you a basis to understand the previous results. The DB at some level maintains those values anyway, all that adding the column does is make them explicit and available for reference. Redo your original queries to show the ID's and youll see the dupes in your output are due to dupes in your input.

        Incidentally adding the id doesnt violate normalization. What violates normalization is the two dupe rows in the table (by definition, any table that contains duplicate data in two columns across multiple rows is not fully normalized). Adding the identity column doesnt change that fact: Your claim that this means you cannot maintain data integrity is rubbish. On the contrary identity columns make it easier to maintain data integrity. Instead of useing "City/Country" as the key to the table, you use the ID as the key, and then put a unique index on City/Country, now your city/country data is guaranteed to be a set and the id's allow non ambiguous reference to a particular location object in the table.

        Assuming these truly represent the same city, are we referring to 01 or 02?

        You have no way of differentiating the two entries without the identity column. You have a way of differentiating the two entries with one. With the identity column references into that table are unambiguous, without the identity column references into that table are ambiguous. Without a constraint on duplicate values in the key columns in the table and without an identity column, the table cannot be unambiguously addressed for all data sets it contains. Ideally you should have both, a identity and a column level constraint prohibiting dupe values.

        End result: you are still using Garbage input to prove that the RDBMS is doing something wrong. Which is just silly.

        ---
        $world=~s/war/peace/g

      If you want to view it as a set of objects where each object has a life of its own and has slots that can be updated with new values etc, that's fine and that's very close to the implementation of modern RDBMSs but that is absolutely not the relational theory and that is Date's point.

      We are stuck with an interface that is derived from the implementation - which can be ugly and sometimes throws up some nasty surprises. We should have an interface that comes from the theory and let the DBMS figure out how to efficiently do the work. This is of course a difficult problem but so is writing a compiler for a high-level language. Current DBMSs are forcing us to program in "assembly language". Maybe a better analogy is that they make us use lots of GOTOs when they should provide us with WHILEs and FORs. A good example is the necessity of have ID columns all over your database. In Date's world, these would not exist - more precisely these would not be part of the interface, although they would almost certainly be part of any efficient implementation, you just wouldn't see them or ever have to even think about them.

      The relational theory is about the only theory of stored data that actually has a full mathematical theory backing it up. None of the "object relational" systems have a proper formal basis for example. As such it's the only one where we are guaranteed consistency and where we can look at 2 queries that looks like they're selecting the same things and say "yes they will both produce the same results no matter what data is in my database".

      Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key).

      That directly violates the Information Principle...
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by dragonchild (Archbishop) on Dec 11, 2005 at 05:01 UTC
    Now right off the bat, if you know the rules of normalization you know these tables are not properly designed. However, for the sake of simplicity, bear with me.

    If you have improperly-designed tables, then you will have improperly-designed queries to deal with the improperly-designed tables. Garbage In, Garbage Out. That said ...

    Your queries were improperly designed. You're conflating two questions into one query, so you have a hacked-up query (on top of improperly-designed tables). Much better is (written for MySQL 4.1+, though it will work on Oracle8+ and PG):

    SELECT parts.pno FROM parts WHERE parts.pname = 'Screw' UNION SELECT supplier_parts.pno FROM supplier_parts WHERE supplier_parts.sno + = 'S1'

    Note the use of a UNION to join two disjoint concepts. There is absolutely no reason to attempt to shoehorn this problem into one query when it's really two separate questions. Plus, the UNION will remove duplicates for you. Still relational theory.

    As for your rants ... I've been following them with some ill-taste in my mouth. Yes, you have had issues with databases (who hasn't?). However, you must realize something - you have a very public profile in the Perl community, which is completely deserved. Please do not use this profile to spread FUD about databases. RDBMS vendors may not be ivory-tower about their engines following relational theory, but neither do your examples. Your entire post was a straw-man with a poorly-concealed axe to grind.

    Now, if you were to provide something in at least 3NF that demonstrates your point about the lack of relational theory, then I'll be glad to learn from you. But, don't start out with something that's not even 1NF, conclude that relational theory is completely ignored by RDBMS vendors, and expect to not be taken to task about it.

    As for NULLs, they were a mistake in the initial SQL spec that, for historical reasons, will never be removed until some enterprising soul goes ahead and builds an extension to a OSS RDBMS that doesn't have them. I use them because the alternative is a performance penalty that my clients won't pay me to suffer. (For the uninitiated, a NULL column demonstrates a denormalization in your schema that should be normalized out into some options table. No-one does this because the performance penalty is prohibitive.)


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      The easy bit first: you and I appear to be in complete agreement about NULLs.

      Now for the hard bit:

      Your queries were improperly designed. You're conflating two questions into one query, so you have a hacked-up query (on top of improperly-designed tables).

      There are two issues with that statement. First, the tables were not improperly-designed. They had bad data in them. This is a very important distinction which goes to the heart of the matter. More on this in a moment.

      As for the "improperly designed" queries, I'll stipulate that they are, but only because SQL is not relational. In reading the queries, someone new to SQL could easily conclude (we see this often enough) that those queries are logically equivalent. It's only SQL's poor support for the relational model which makes those query results different. In fact, Date goes on to show 12 different ways of writing that query (including yours) which return 9 different results but which would all return the same, correct result if operations on relations (tables, if you will) returned sets. In fact, one of the fundamental points of the relational model is that a relation (what we normally think of as a table) is an unordered set (in the mathematical sense) of tuples. Any operation on one or more relations must return another relation -- in other words, it must return a set. This includes selecting, inserting or updating information. If operations violate this, then the system implementing this operation is by definition not relational.

      This gets back to the first point about the bad data in the tables. First, a set is a collection of unique items. There is no such thing as the set of integers (3,3,2). That should be the set of integers (3,2). Further, the union of (1,2,3) and (2,3,4) yields (1,2,3,4), not (1,2,2,3,3,4). SQL allows us to add duplicate tuples in tables. That's one of the reasons they're called "tables" and not "relations". Sure, I can put a unique constraint across all of the columns, but why should I? They should be unique in the first place. There's no reason those duplicates should exist and having SQL graciously allow us a constraint which should be there in the first place doesn't fix the problem (and I tell ya, it's real fun finding out that some maintenance programmer put a few thousand duplicate rows in a lookup table with no unique key to let me delete the extras).

      So right off the bat, SQL allows us to violate the relational model when inserting records. Of course, one could argue that "that's just bad database design". Fine, let's stipulate that you won't enter duplicate rows.

      Now, if you were to provide something in at least 3NF that demonstrates your point about the lack of relational theory, then I'll be glad to learn from you. But, don't start out with something that's not even 1NF, conclude that relational theory is completely ignored by RDBMS vendors, and expect to not be taken to task about it.

      Again, the mere fact that SQL allows duplicate rows in tables means that SQL ignores the relational model and DBMS vendors are largely responsible for not providing us with tools which respect the model. However, if that's not enough to convince you ...


      So a salesperson walks up to an accountant and says "Bob, for our customers with excellent credit ratings, what cities do they live in? I need to plan flights through those cities."

      "What do you mean by 'excellent credit rating', Alice?"

      "For the sake of argument, let's include everyone with a credit rating greater than 700."

      "OK, let's see ... there's London. Oh, and there's London. And Paris. And London. London again, Athens, London, Paris, Moscow, London and Paris."

      "That's a hell of a flight."


      The answer Bob should have given Alice is "London, Paris, Moscow, and Athens". Repeating those city names over and over without reason doesn't make his assertion any more true. In fact, it almost makes the response non-sensical and if your accountant always answered questions that way, you'd think he's mental. But that's the sort of answer this query gives:

      SELECT city FROM cities, customers WHERE customers.city_id = cities.city_id AND customers.credit_rating > 700

      That could easily be on a database in 3NF. There's nothing about the structure of the query which implies that the database is poorly designed.

      Alice shouldn't have to tell Bob to give her "DISTINCT" cities because repeating the city names not only makes no sense, it helps to confuse the issue. SQL does not return sets by default. If it returns duplicates, it's not a set and thus the system isn't relational. Sets do not have duplicates. Now in the example above, if Alice is really stupid, she's going to have one hell of a long flight. Fortunately, Alice is not really stupid. Unfortunately, software is. If, instead of Alice, it's some software receiving the results of that query, things could go disastrously wrong.

      If SQL followed relational theory, simple queries like this would return correct results. Instead, the SQL programmer has to remember that SQL will happily churn out garbage. The programmer is then put in the awkward position of having to either put "SELECT DISTINCT" everywhere or remember that "SELECT DISTINCT" can have horrible impacts on performance and thus should be used very selectively. If SQL was relational, the programmer would never have to worry about this.

      What I'd like to hear, and I haven't, is a justification for the "not returning sets" behavior. If there any realistic justification one can give for queries to return duplicate information, I'd love to hear them. (The "but it's too slow otherwise" response is faulty because it confuses implementation with behavior.)

      Cheers,
      Ovid

      New address of my CGI Course.

        (Below, when I speak about an RDBMS, I am speaking about Oracle9i. Almost every statement will be applicable to the majors, but they are correct for Oracle.)

        First, the tables were not improperly-designed. They had bad data in them. This is a very important distinction which goes to the heart of the matter.

        A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data. It might have incorrect data, but the data cannot be bad. I will address your further points in the order you presented them, but this is a critical understanding of RDBMSes that you don't seem to have. (It's rare to see it in someone who hasn't been a DBA, so don't feel bad.)

        Another critical understanding that you're missing - the "relational" part in an RDBMS isn't for the table - it's for both the row and how rows in different tables relate. All the columns in the row are related to the primary key column(s) and columns in different tables relate rows together. The "set" part comes in that the keys are unique, not that the results of a query are unique. A table is a set of primary keys, along with the values that they index into. The Perl analogy would be a hash. The keys of a hash are a set, but the values are not. This is a good thing.

        No, SQL is not relational. If it was, it would be called RQL for "Relational Query Language," not "Structured Query Language." SQL was designed for Business Analysts, not programmers. It's a 4GL - one of those "natural language programming languages" that were touted in the 80's. That it's a complete failure both as a programming language and as a tool to work with databases is only to have been expected.

        ... a relation (what we normally think of as a table) is an unordered set (in the mathematical sense) of tuples. ... SQL allows us to add duplicate tuples in tables. That's one of the reasons they're called "tables" and not "relations". Sure, I can put a unique constraint across all of the columns, but why should I?

        You're conflating so many different concepts here that I almost don't know where to start.

        • Most RDBMSes make a distinction between DML and DDL - the language one uses to query the tables (Data Manipulation Language), such as SQL, and the language one uses to define the tables (Data Definition Language). Neither of these are "relational." In fact, it doesn't make sense to talk about them as "relational." They can be used to interact with things that, if designed and built properly, are relational.
        • A table in a database doesn't make any sense unless it has a primary key. In fact, that is why the First Normal Form (or 1NF) is "Every row must have a primary key." Without that, you're absolutely correct in that the table is a bag and not a set. With that, the table is a set that defines a relation between the primary key column(s) and the dependent column(s). You don't need to have a unique constraint across all the columns, just across the one(s) that make up the primary key.
        • While you can think of the row as a tuple of all its coluns, it's more accurate to think of it as a function that, given a tuple for the primary column(s), returns a tuple for the dependent column(s). After all, a relation is nothing more than a function that maps items from one set to items in another set.
        • You are correct in that SQL will allow you to attempt to do anything you want to a given table. It's up to the table's designer(s) to have set the appropriate restrictions upon the table.
        • Tables are called tables and not relations because they aren't relations. The relation is within the row, not the table. (See above.)

        As for Bob and Alice, there's a very good reason why SQL not only gives back that answer, but SHOULD give back that answer, and it comes down to the Principle of Least Surprise.* Let's say you have the query SELECT foo FROM my_table; That returns back 10 rows. Under the current system, that means that there's 10 rows in my_table and here are the values in the foo column for each one. Under your idea, that means there are 10 distinct foos in my_table, regardless of how many rows.

        Now, what hapens when I say SELECT foo, bar FROM my_table;? Under the current system, this still returns 10 rows. Under your system, it could return any number of rows, so long each row it returns is a unique (foo,bar) combination. For people that don't understand anything about set theory, that's somewhat counter-intuitive.

        Believe me - I do understand your consternation. The idea that an operation upon a set returns a collection is counter-intuitive to those who understand any set theory at all. But, I point back to my earlier point - the set isn't the rows in the table taken as a whole - it's the range of the function between the primary key and the dependent columns. As long as your query returns the primary key, it is a set. If your query doesn't, then it doesn't returns a set.

        To elucidate, let's look very quickly at what a SELECT does, under the hood. The basic operation, as you've noted, is to take a set (the PK tuple), apply a set of restrictions on it (either the PK tuple itself or the values tuple), then return the requested values from all rows that satisfy those restrictions. As long as your table(s) being queried was a set to begin with, the results will also be a set. However, your view of the results may or may not be a set, depending on what you've chosen to see.

        Let's take your specific example:

        SELECT cities.name FROM cities, customers WHERE customers.city_id = cities.city_id AND customers.credit_rating > 700
        I'm going to rewrite as:
        SELECT (cities.name) FROM (customers JOIN cities USING (city_id)) WHERE (customers.credit_rating > 700)
        Assuming that customers and cities are both sets, then we have four sets involved in that query.
        1. cities
        2. customers
        3. the resultset
        4. the fromset
        Your resultset is a set, believe it or not. It's the set of all rows from the intermediate table "customersXcities" that satisfies the criterion of customers.credit_rating > 700. This intermediate table has the same PK of your customers table. Think about that for a second - you're not querying the cities table directly. You're querying an intermediate table. Each row in your resultset is a row from that intermediate table. Then, the SELECT clause applies a set of vertical black strips to the resultset, each on top of a column. You turn off the strip by requesting that column's values be given to you. The SELECT clause isn't a set operation.

        *: Don't give me that shocked look - you had to have known that the PoLS would make an appearance in this discussion. If only so that I could tweak you about how your surprise was anticipated before you were born. :-)


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by renodino (Curate) on Dec 11, 2005 at 17:02 UTC
    Perhaps a little turnabout will illustrate my primary concern regarding the OP, i.e., the hyperbolic title:

    <sarcasm>

    Why Perl sucks (and some SQL to fix it)

    I've been having problems with my Perl scripts for awhile, and now that I've read "SQL for Smarties", I realize the problem isn't my misunderstanding of Perl hashes, but that Perl is a bad language for data manipulation.

    Consider the following two database tables:

    +-------------+ +------------------+ | Products | | Supplier_Product | +-------------+ +------------------+ | PNO | PNAME | | SNO | PNO | +-----+-------+ +---------+--------+ | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P2 | | P2 | Screw | +------------------+ +-------------+
    Now right off the bat, I'll say I need to maintain duplicate rows in order to compute some statistical rollups on the data, so the data model isn't flawed.

    When I load these into my Perl hash, using PNO, and SNO as keys, I end up with just 2 Products rows, and 2 Supplier_Product rows! Perl is violating my requirements! Perl should read my mind, and automatically promote the hash values from scalars to arrayrefs for each duplicate key! Sure, I could write some code to check for and fix it, but the fact that Perl doesn't preserve my input is bad. Ergo, Perl sucks!

    Fortunately, I can fix this by using SQL, which can preserve my multiset data model:

    SELECT parts.pno FROM parts WHERE parts.pname = 'Screw' OR parts.pno IN (SELECT supplier_parts.pno FROM supplier_parts WHERE supplier_parts.sno = 'S1')
    which returns the 3 P1s and 1 P2, which is consistent with my data model. And yet, I can also eliminate the duplicate rows by just adding a single keyword:

    SELECT DISTINCT parts.pno FROM parts WHERE parts.pname = 'Screw' OR parts.pno IN (SELECT supplier_parts.pno FROM supplier_parts WHERE supplier_parts.sno = 'S1')
    which returns 1 P1 and 1 P2. So SQL provides much greater flexibility than Perl. Ergo, Perl sucks.

    And don't get me started about undef! How can anyone possibly use undef ? It doesn't mean anything, and Perl barks when I do something like

    use warnings; my $foo = undef; print "This is some ", $foo, "\n";
    Whats that all about ? Man, Perl sure does suck.

    And why doesn't Perl automatically know what data I'm going to feed it at compile time ? I.e., why doesn't Perl complain at compile time when I enter:

    perl somescript.pl 1 2 three four
    for the script
    my $sum; $sum += $_ foreach @ARGV; print $sum, "\n";
    it prints the incorrect value 3! Sure, I could add some code to test the input and either translate it or throw it out if its not numeric, but that might effect performance. Man, Perl really sucks!

    <sarcasm/>

    Perl doesn't suck. Neither does SQL. They may be imperfect, they may not always do what we mean to do, but very few languages have ESP enabled runtime components. If you choose to use hyperbolic titles, and then use (by your own admission) flawed strawmen to illustrate your points, and further admit that the language you demean actually does accomodate your flawed exemplar (i.e., by using DISTINCT), your incendiary tone seems a bit misplaced.

      When I load these into my Perl hash, using PNO, and SNO as keys, I end up with just 2 Products rows, and 2 Supplier_Product rows! Perl is violating my requirements! Perl should read my mind, and automatically promote the hash values from scalars to arrayrefs for each duplicate key! Sure, I could write some code to check for and fix it, but the fact that Perl doesn't preserve my input is bad.

      Perl programmers, unlike many SQL programmers, do not claim that their tool follows a particular mathematical model. Maybe you understand that that SQL does not follow the relational model, but many don't and this misunderstanding means bad queries returning bad results. Further, Perl, unlike SQL, does not pretend to handle relational data gracefully. In fact, I gave a presentation at the last OSCON which dealt quite heavily with this topic. I'm quite aware of Perl's shortcomings here and I don't pretend they don't exist.

      So from where I'm sitting, your cute inversion of what I wrote doesn't make sense. There is no rigourous theoretical model which Perl is sweeping under the rug. There is a huge theoretical model which SQL is sweeping under the rug.

      Perl doesn't suck. Neither does SQL. They may be imperfect, they may not always do what we mean to do, but very few languages have ESP enabled runtime components.

      "ESP enabled runtimes"? If SELECT statements returned sets by default, their behavior would match the behavior that many new to databases would expect. Just because you're used to how SQL works doesn't mean that newcomers are neccessarily going to understand it. Set operations return sets. That's how they work. SQL does not return sets. SQL returns bags. Why? Some people argue "oh, but it's too slow to always returns sets". Fine. If database vendors had spent the past thirty years ensuring that the relational model was supported rather than this weird hybrid language which fits no conceptual model, the "too slow" argument might very well not be there. So separating implementation from behavior for a moment, how can you justify the default behavior of SELECT to be returning bags instead of sets? You can read my response to dragonchild for a more thorough grounding of my reasoning here. Since so many ordinary "common man" problems with SQL are related to programmers not being aware of the theory, why does it make sense that SQL allow them to violate by default?

      There is absolutely nothing wrong with pointing out SQL's flaws. Just as understanding Perl's flaws allows us to better handle them when we run up against them, understanding that SQL does not follow the relational model allows us to be better prepared when we hit this limitation. You may not like the fact that I've pointed out a persistent problem with SQL, but note that I not only pointed out the problem, I also listed a solution to that problem and discussed it's pros and cons. I can't see as to how that's a bad thing.

      Cheers,
      Ovid

      New address of my CGI Course.

        I think the simplest response is that SQL does return SETs. If your tables are sets. Your example tables aren't even sets so that makes the conversation that much harder to carry on.

        The way I've always thought about it is that the WHERE clause defines the set, the SELECT clause defines what your VIEW of that set should show. With this view, I would never expect changing the SELECT clause to change the rows returned by they WHERE clause. I've never once been confused by this issue you talk about, they might not be "relational" by that definition but I've never actualy seen a DB claim that it implements relational theory. Conflicting definitions of relations and application of those definitions to different parts might be why you seem to be argueing for something no one else here understands. Mathematical concepts are nice, but I don't want to have to know relational theory in order to get a list of cities. SQL and all the DBs I've used provide the least amount of surpirse and the mose amount of flexibility, you can get what you want, I can get what I want, can't we all just get along? ;)

        BTW Public Relations people make all those claims about what a DB can do, and who is realy suprised that PR people stretched the truth or even just plain lied? ;)

        Finaly the simplist answer. You ask "Why doesn't it return a set." Answer: Returning a set based on the colums in the SELECT clause would cause more suprise than the current implementation. You are free to "group by" or "DISTINCT" in order to get your expected set though.


        ___________
        Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
        Thanks for the interesting debate, guys.

        It makes me want to do a bit of reading up on a lot of things I've taken for granted in the past.

        Then after the reading I can come back to this debate and work out what it all means. 8-)
Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by tilly (Archbishop) on Dec 13, 2005 at 03:14 UTC
    Congratulations. You have just come to a design decision in Microsoft Access that bit me hard once.

    Access puts DISTINCT on all queries by default. I learned this the hard way as a very junior programmer when I was trying to figure out the following. There was a table of receipts. The table was queried to find the sum of the receipts. Then it was queried to pull back information about the receipts to dump into Excel. As a first sanity check, the Excel dump was summed and came up with a different total than Access had given!

    The problem was that the dump for Excel didn't include the primary key, so there were duplicate rows that were merged by Access. But those duplicate rows were not redundant information - the counts of how many you had were very important!

    So there was a real life case where DISTINCT clearly is not the right thing to do.

    Now with that said, I will read the book, but I'm not convinced that I will agree with it. My main problem with databases isn't that I encounter normalization problems or have trouble understanding the relational logic. It is that I am asked to do things with it that can't be expressed efficiently in a relational manner.

    Perhaps you consider that an optimization problem, and lazily wave the database should take care of it. But I can't. I need to solve those problems, the database isn't taking care of it for me. In fact Oracle cheerfully tells me that they won't take care of it, and if I want to take care of it for myself, then I need to use their analytic extensions.

    So while I'm willing to read their promotional literature, I'm not about to get the relational religion.

      It is that I am asked to do things with it that can't be expressed efficiently in a relational manner.

      I think it would be interesting to have a better description of such tasks. At the very least some our SQL deficient readers will get an idea of things that dont work well in relational databases.

      One such example that comes to mind is storing heirarchies. Relational semantics and directed acyclic graphs dont mix nicely. :-)

      ---
      $world=~s/war/peace/g

        The example that I am most commonly interested in is maintaining running totals. For instance given a table with payments and receipts, determine what our lowest account balance was.

        It isn't all that hard to write that in standard SQL. But the solutions that you get won't execute very quickly...

Re: (OT) Why SQL Sucks (with a little Perl to fix it)
by Anonymous Monk on Dec 13, 2005 at 23:45 UTC
    I have no knowledge of "set theory", so I may be missing the mark here, but these are my thoughts on what you've presented as a "problem":

    How do you expect the database to know that you don't want duplicate rows if you don't tell it the rows shouldn't be duplicate? Moreover, how do you expect the database to know which columns comprise a duplicate row if you don't tell it?

    Take your example,
    +-------------+ +------------------+ | Products | | Supplier_Product | +-------------+ +------------------+ | PNO | PNAME | | SNO | PNO | +-----+-------+ +---------+--------+ | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P1 | | P1 | Screw | | S1 | P2 | | P2 | Screw | +------------------+ +-------------+
    and the following example,
    +--------------------+ +-------------------------------+ | Purchased_Items | | Package_Deliveries | +--------------------+ +-------------------------------+ | Date | Item | | Delivery_Company | Date | +------------+-------+ +------------------+------------+ | 2005/12/01 | Screw | | FedEx | 2005/12/01 | | 2005/12/01 | Screw | | FedEx | 2005/12/01 | | 2005/12/01 | Screw | | FedEx | 2005/12/02 | | 2005/12/02 | Screw | +-------------------------------+ +--------------------+
    The two sets of tables have the same structure, but duplicate rows are valid in the second case. Three screws were purchased on December 1st, and FedEx dropped off 2 packages on December 1st. You don't know which screws belongs to which package, but one would assume that you don't need to perform that link if you structured the tables in this manner. Also, you may or may not want to use DISTINCT, depending on whether or not you want to know that 3 screws were delivered, or you just want to know that screws were delivered.


    If PNO and PNAME are supposed to be unique, put a constraint on the table preventing duplicates from getting inserted in the first place. If they aren't supposed to be unique, you need to tell the database when you want unique rows returned (i.e. DISTINCT), and when you don't.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://515776]
Approved by Popcorn Dave
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2014-12-21 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (100 votes), past polls