in reply to Re^3: (OT) Why SQL Sucks (with a little Perl to fix it) in thread (OT) Why SQL Sucks (with a little Perl to fix it)
You know, I think you and I are largely in agreement on what many of the core ideas are yet we differ about what we think valid behavior should be, though we'll quibble in the margins.
A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data.
I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns. (Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF -- see my response to Celada for more discussion).
As for your comments about "relations", I can only refer you to Date's work. It would take me a loooong time to write out exactly what I am referring to when I say "relation". This term goes back to Codd's initial formulation of the relational model and is loosely analogous to what we refer to as a table.
For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?
(I've a sneaky feeling that neither of us is going to give much ground here :)
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by BrowserUk (Patriarch) on Dec 11, 2005 at 21:52 UTC
|
if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set?
Take a simple table with columns Name(Primary key (ignoring the possibility of two John Smiths for now) and age. You do select Age from table Table;. Your purpose is to calculate the average age. If the results set returned did not include all the ages, including duplicates, your calculation would be wrong.
The tuples in the tabke are unique, but you aren't requesting tuples, you are requesting a substring of each tuple.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
That's not a general rule for preferring bags to sets, it's an exception to the rule.
In any event, I think it's also a mistake. You want two aggregate functions, SUM and COUNT. If you're iterating over duplicate results like that, you've made a mistake. Ignoring that most (all?) SQL implementations support an AVG operator, you could always do this:
SELECT SUM(age) / COUNT(age) AS average_age FROM table
So with all due respect, even your special case doesn't seem to me to be applicable.
Of course, that's brings up a related issue:
SELECT SUM(customer_id) / COUNT(customer_id) FROM customers
That's clearly nonsense but it's quite legal. If SQL followed the relational model it wouldn't even compile.
| [reply] [d/l] [select] |
|
But it is *not* a special case. In reductio ad absurdum, what you're suggesting is that no results set could ever contain duplicate fields. So,
- Your company personnel database could not be allowed to have two people with the same name, because if it did, and you asked for a print out of all the names, it would violate your rule.
- And you couldn't employ two people with the same age, or date of birth.
- Hmmm. What about number of legs or eyes or ears?
Now I'm getting absurd, but that is what I set out to do.
I guess your response will be: "It doesn't make any sense to produce any of those results sets. What is the point of a list of names with duplicates, if you cannot distinguish between the duplicates". That presupposes that there is no legitimate use for such a list of names, but you cannot know that.
Name tags for the company beano, or name plates for office doors, lockers or mailslots.
It doesn't matter that the two John Smiths have identical badges/labels. Those that know them will know the difference, and those that don't, won't care. They'll just say "Hi John, what do you do"? Anyone who needs to know the difference will. Including them.
Another example. You are shipping boxes of product X to customers. Your business rules, encapsulated within your database, know how many X will fit inside each type of box, how much those will weigh, and how much they will cost to ship. That's all worked out and stored inside your DB. Now you need to produce a set of shipping labels to stick on those boxes. Where a customer has ordered more items than fit in your largest box, or a number that better fit into 2 smaller boxes, the query will produce an identical label for each box going to that customer.
It doesn't matter which label goes on which box, (nor which individual Xs go in which box), there is simply no good reason to distinguish them.
It does matter that both (all) boxes get labelled, and for the DB to quietly fail to produce the duplicates for some high falutin' principle or mathematical reasoning is bogus. Yes, it would be possible to add a distinguishing box number to the DB to make them distinct--but it's equally possible to use the DISTINCT keyword whenever that's what you need.
And that latter course does not require you to add artificial fields to your DB tables just to satisfy some principle.
This is very similar to that other hoary ol' chestnut of the to be, or not to be relational debate.
Storing hierarchical data in a relational database is a pain. Yes, the relational gurus will tell you that it is perfectly feasible to do so, and there is an example (by tilly?) somewhere around this site. But it requires a mitt-full of non-data "auxiliary fields", and one (or more) artificial auxiliary tables per level of hierarchy, to achieve the structuring; the real data has to be duplicated in several places to achieve it; you have to preprocess your data before insertion in order to invent the auxiliary field values; and every 'simple' query of that data the depends upon the it's structure for the correct selection becomes a multi level affair requiring a general purpose program to piece the hierarchy back together. It is, in every way a PITA to do. There are very, very few people that can sit down and remember how to do it without looking it up.
This is wrong. The world, and especially the commercial world is full of hierarchical data. Having to pre- and post process that data and add all manner of extra, DB-only numbers to it in order to get it in and out of your DB, simply so that your DB can pretend the world and everything in it can be represented as a 2D grid of tuples and relations is daft!.
Does that mean that relational DB are useless, and we should return to the bad old days of hierarchical DBs? No, obviously not.
What is does mean, is that whenever you try to impose one, mathematically pure view onto everything in the world, you create as many problems as you fix. What's wrong with having two datamodels--a relational one and a hierarchical one--that coexist and interoperate?
Many data, and many uses of data, elegantly and efficiently fit the relational view. Many data and uses do not. There are so many examples of hierarchical data in the real world, bending them all to comply with a mathematical principle is ....
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by dragonchild (Archbishop) on Dec 12, 2005 at 02:38 UTC
|
Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF
Yes, because a PK is 1NF, not 3NF. There's two more normalizations required before you get to 3NF. :-)
For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?
Let's start with one fact - the SELECT clause is handed a set of tuples and it applies a collection of vertical slices upon said set. So, your question can be rephrased as "What's the reason for not applying set semantics (SELECT DISTINCT) to said slice(s)?"
There are two reasons:
- The number of rows in the resultset (which is a set if all the initial tables are sets) shouldn't change depending on what vertical elements you want to see exposed. Think about an Excel spreadsheet. If you hide column D, does the number of rows change? That's all you're doing in a SELECT clause.
- The performance hit for applying a DISTINCT is quite high. Try doing a relatively complex query both as SELECT (returning duplicates, as you see them) and SELECT DISTINCT (removing what you see as duplicates) and time them. Now, imagine that on every single query you will ever run in your life. And, no, there aren't that many more performance gains to be made.
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] |
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by Theory (Beadle) on Dec 12, 2005 at 00:16 UTC
|
I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.
I don't think that's what you want. Every row (tuple) has to be distinct from every other row. Having a single primary key column makes that so. The problem comes in when you try to query that table. Given this table:
id | city | country
----+--------+---------
1 | Paris | France
2 | London | England
3 | Paris | Texas
4 | Paris | France
The presence of the primary key (id) makes it a proper relation. However, when we try to query it like this:
select city, country from test;
city | country
--------+---------
Paris | France
London | England
Paris | Texas
Paris | France
That is not a relation, because it has a duplicate in it. The problem isn't that the table isn't a relation, but that the result of our SQL query isn't a relation. So what is it? This gets to dragonchild's other point, to which you said:
I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?
As I said the result of the above query is not a relation, but a report on the contents of the columns I queried, with the number of rows replicated. It makes sense in that it reports how many rows they are, and SQL is a row-oriented reporting language. So it is consistently more correct if my question is "What are the contents of the city and country colunns in all rows?" Of course, that's not usually a very useful question, but it is at least consistent with how SQL works.
So that's how it makes sense. And as you and I agree, that makes it unrelational. It's a report on rows in a table, not a set created by querying a relation.
—Theory | [reply] |
|
| [reply] |
|
CREATE TABLE supplier (
id SERIAL PRIMARY KEY,
name TEXT
);
INSERT INTO supplier (name) VALUES ('Apple'); -- Computer
INSERT INTO supplier (name) VALUES ('Microsoft');
INSERT INTO supplier (name) VALUES ('Apple'); -- Records
SELECT * FROM supplier;
id | name
----+-----------
1 | Apple
2 | Microsoft
3 | Apple
This table is in the third normal form, in so far as I understand it, because "none of the non-primary key attributes is a fact about any other non-primary key attribute". However, you'll notice that there is a duplicate, in that "Apple" is listed twice. That's because there are two different companies named "Apple". So this table is technically in 3NF, but you cannot have a unique index across all of its columns.
That's not to say that this is a good example, or that you couldn't work around this issue in various ways, but it demonstrates, I hope, that 3NF does not mandate that all columns be unique.
—Theory | [reply] |
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by demerphq (Chancellor) on Dec 12, 2005 at 08:45 UTC
|
I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.
Unique constraints like the sort you mean are normally handled through the creation of unique indexes on the columns involved. A properly designed table should forbid the insertion of bad data, either through column level constraints, row level constraints, unique indexes or triggers.
Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF
The tables violated normalization prior to adding the index, all the index does is allow you to deal with the fact, something that you can't do very well without it. If you could explain why that table makes more sense without the id column than with it then you might have a case, but I dont think you can.
---
$world=~s/war/peace/g
| [reply] |
|
|