Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^4: (OT) Why SQL Sucks (with a little Perl to fix it)

by Ovid (Cardinal)
on Dec 11, 2005 at 20:06 UTC ( [id://515897]=note: print w/replies, xml ) Need Help??


in reply to Re^3: (OT) Why SQL Sucks (with a little Perl to fix it)
in thread (OT) Why SQL Sucks (with a little Perl to fix it)

You know, I think you and I are largely in agreement on what many of the core ideas are yet we differ about what we think valid behavior should be, though we'll quibble in the margins.

A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data.

I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns. (Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF -- see my response to Celada for more discussion).

As for your comments about "relations", I can only refer you to Date's work. It would take me a loooong time to write out exactly what I am referring to when I say "relation". This term goes back to Codd's initial formulation of the relational model and is loosely analogous to what we refer to as a table.

For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

(I've a sneaky feeling that neither of us is going to give much ground here :)

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re^4: (OT) Why SQL Sucks (with a little Perl to fix it)

Replies are listed 'Best First'.
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by BrowserUk (Patriarch) on Dec 11, 2005 at 21:52 UTC
    if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set?

    Take a simple table with columns Name(Primary key (ignoring the possibility of two John Smiths for now) and age. You do select Age from table Table;. Your purpose is to calculate the average age. If the results set returned did not include all the ages, including duplicates, your calculation would be wrong.

    The tuples in the tabke are unique, but you aren't requesting tuples, you are requesting a substring of each tuple.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That's not a general rule for preferring bags to sets, it's an exception to the rule.

      In any event, I think it's also a mistake. You want two aggregate functions, SUM and COUNT. If you're iterating over duplicate results like that, you've made a mistake. Ignoring that most (all?) SQL implementations support an AVG operator, you could always do this:

      SELECT SUM(age) / COUNT(age) AS average_age FROM table

      So with all due respect, even your special case doesn't seem to me to be applicable.

      Of course, that's brings up a related issue:

      SELECT SUM(customer_id) / COUNT(customer_id) FROM customers

      That's clearly nonsense but it's quite legal. If SQL followed the relational model it wouldn't even compile.

      Cheers,
      Ovid

      New address of my CGI Course.

        But it is *not* a special case. In reductio ad absurdum, what you're suggesting is that no results set could ever contain duplicate fields. So,

        • Your company personnel database could not be allowed to have two people with the same name, because if it did, and you asked for a print out of all the names, it would violate your rule.
        • And you couldn't employ two people with the same age, or date of birth.
        • Hmmm. What about number of legs or eyes or ears?

        Now I'm getting absurd, but that is what I set out to do.

        I guess your response will be: "It doesn't make any sense to produce any of those results sets. What is the point of a list of names with duplicates, if you cannot distinguish between the duplicates". That presupposes that there is no legitimate use for such a list of names, but you cannot know that.

        Name tags for the company beano, or name plates for office doors, lockers or mailslots.

        It doesn't matter that the two John Smiths have identical badges/labels. Those that know them will know the difference, and those that don't, won't care. They'll just say "Hi John, what do you do"? Anyone who needs to know the difference will. Including them.

        Another example. You are shipping boxes of product X to customers. Your business rules, encapsulated within your database, know how many X will fit inside each type of box, how much those will weigh, and how much they will cost to ship. That's all worked out and stored inside your DB. Now you need to produce a set of shipping labels to stick on those boxes. Where a customer has ordered more items than fit in your largest box, or a number that better fit into 2 smaller boxes, the query will produce an identical label for each box going to that customer.

        It doesn't matter which label goes on which box, (nor which individual Xs go in which box), there is simply no good reason to distinguish them.

        It does matter that both (all) boxes get labelled, and for the DB to quietly fail to produce the duplicates for some high falutin' principle or mathematical reasoning is bogus. Yes, it would be possible to add a distinguishing box number to the DB to make them distinct--but it's equally possible to use the DISTINCT keyword whenever that's what you need.

        And that latter course does not require you to add artificial fields to your DB tables just to satisfy some principle.

        This is very similar to that other hoary ol' chestnut of the to be, or not to be relational debate.

        Storing hierarchical data in a relational database is a pain. Yes, the relational gurus will tell you that it is perfectly feasible to do so, and there is an example (by tilly?) somewhere around this site. But it requires a mitt-full of non-data "auxiliary fields", and one (or more) artificial auxiliary tables per level of hierarchy, to achieve the structuring; the real data has to be duplicated in several places to achieve it; you have to preprocess your data before insertion in order to invent the auxiliary field values; and every 'simple' query of that data the depends upon the it's structure for the correct selection becomes a multi level affair requiring a general purpose program to piece the hierarchy back together. It is, in every way a PITA to do. There are very, very few people that can sit down and remember how to do it without looking it up.

        This is wrong. The world, and especially the commercial world is full of hierarchical data. Having to pre- and post process that data and add all manner of extra, DB-only numbers to it in order to get it in and out of your DB, simply so that your DB can pretend the world and everything in it can be represented as a 2D grid of tuples and relations is daft!.

        Does that mean that relational DB are useless, and we should return to the bad old days of hierarchical DBs? No, obviously not.

        What is does mean, is that whenever you try to impose one, mathematically pure view onto everything in the world, you create as many problems as you fix. What's wrong with having two datamodels--a relational one and a hierarchical one--that coexist and interoperate?

        Many data, and many uses of data, elegantly and efficiently fit the relational view. Many data and uses do not. There are so many examples of hierarchical data in the real world, bending them all to comply with a mathematical principle is ....


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by dragonchild (Archbishop) on Dec 12, 2005 at 02:38 UTC
    Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF

    Yes, because a PK is 1NF, not 3NF. There's two more normalizations required before you get to 3NF. :-)

    For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

    Let's start with one fact - the SELECT clause is handed a set of tuples and it applies a collection of vertical slices upon said set. So, your question can be rephrased as "What's the reason for not applying set semantics (SELECT DISTINCT) to said slice(s)?"

    There are two reasons:

    1. The number of rows in the resultset (which is a set if all the initial tables are sets) shouldn't change depending on what vertical elements you want to see exposed. Think about an Excel spreadsheet. If you hide column D, does the number of rows change? That's all you're doing in a SELECT clause.
    2. The performance hit for applying a DISTINCT is quite high. Try doing a relatively complex query both as SELECT (returning duplicates, as you see them) and SELECT DISTINCT (removing what you see as duplicates) and time them. Now, imagine that on every single query you will ever run in your life. And, no, there aren't that many more performance gains to be made.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by Theory (Beadle) on Dec 12, 2005 at 00:16 UTC

    I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.

    I don't think that's what you want. Every row (tuple) has to be distinct from every other row. Having a single primary key column makes that so. The problem comes in when you try to query that table. Given this table:

     id |  city  | country 
    ----+--------+---------
      1 | Paris  | France
      2 | London | England
      3 | Paris  | Texas
      4 | Paris  | France
    

    The presence of the primary key (id) makes it a proper relation. However, when we try to query it like this:

     select city, country from test;
      city  | country 
    --------+---------
     Paris  | France
     London | England
     Paris  | Texas
     Paris  | France
    

    That is not a relation, because it has a duplicate in it. The problem isn't that the table isn't a relation, but that the result of our SQL query isn't a relation. So what is it? This gets to dragonchild's other point, to which you said:

    I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

    As I said the result of the above query is not a relation, but a report on the contents of the columns I queried, with the number of rows replicated. It makes sense in that it reports how many rows they are, and SQL is a row-oriented reporting language. So it is consistently more correct if my question is "What are the contents of the city and country colunns in all rows?" Of course, that's not usually a very useful question, but it is at least consistent with how SQL works.

    So that's how it makes sense. And as you and I agree, that makes it unrelational. It's a report on rows in a table, not a set created by querying a relation.

    —Theory

      Go back to that post and click the "response to Celada" link I had. I made it clear that the presense of a unique key is not enough. The duplicates are an issue in that case because the table is not in 3NF. If the table is is 3NF, then that unique constraint across all keys holds.

      Cheers,
      Ovid

      New address of my CGI Course.

        If the table is is 3NF, then that unique constraint across all keys holds.

        I don't think that's quite true. Consider this simple example:

        CREATE TABLE supplier (
          id SERIAL PRIMARY KEY,
          name TEXT
        );
        
        INSERT INTO supplier (name) VALUES ('Apple');     -- Computer
        INSERT INTO supplier (name) VALUES ('Microsoft');
        INSERT INTO supplier (name) VALUES ('Apple');     -- Records
        
        SELECT * FROM supplier;
         id |   name    
        ----+-----------
          1 | Apple
          2 | Microsoft
          3 | Apple
        

        This table is in the third normal form, in so far as I understand it, because "none of the non-primary key attributes is a fact about any other non-primary key attribute". However, you'll notice that there is a duplicate, in that "Apple" is listed twice. That's because there are two different companies named "Apple". So this table is technically in 3NF, but you cannot have a unique index across all of its columns.

        That's not to say that this is a good example, or that you couldn't work around this issue in various ways, but it demonstrates, I hope, that 3NF does not mandate that all columns be unique.

        —Theory

Re^5: (OT) Why SQL Sucks (with a little Perl to fix it)
by demerphq (Chancellor) on Dec 12, 2005 at 08:45 UTC

    I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.

    Unique constraints like the sort you mean are normally handled through the creation of unique indexes on the columns involved. A properly designed table should forbid the insertion of bad data, either through column level constraints, row level constraints, unique indexes or triggers.

    Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF

    The tables violated normalization prior to adding the index, all the index does is allow you to deal with the fact, something that you can't do very well without it. If you could explain why that table makes more sense without the id column than with it then you might have a case, but I dont think you can.

    ---
    $world=~s/war/peace/g

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://515897]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found