Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

SQL database architecture

by punchcard_don (Beadle)
on Nov 03, 2003 at 18:57 UTC ( #304211=perlquestion: print w/ replies, xml ) Need Help??
punchcard_don has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I realize this is more of an SQL question, but all the middleware will be Perl querying a mySQL database using DBI, I promise.

Not looking for someone to tell me what to do, or do my job. I'm seeking methodological or, even better, quantitative, means to choose/justify one SQL database architecture over another where query performance is the priority.

Imagine this: A database holds students' school exam results. By the end of their scholastic career, each person has taken 500 exams. Every exam had 50 multiple-choice questions with answers numbered 1 to 9. Our starting point is a flat-file of records for 20,000 students of the form:

student1, exam1, answer1, answer2,..., answer50
student1, exam2, answer1, answer2,..., answer50
...
student1, exam500, answer1, answer2,..., answer50

student2, exam1, answer1, answer2,..., answer50
student2, exam2, answer1, answer2,..., answer50
...
student2, exam500, answer1, answer2,..., answer50

Now we want to compile statistics on who answered what. What percentage of total students who got question x in exam y right, also got question b of exam c right? In other words

SELECT COUNT(*) WHERE response-to-queston-x-on-exam-z=y AND response-to-queston-b-on-exam-d=c

and we want to do this over the web letting researchers specify the questions and answers they query on.

A few possible architectures:

1. Design the database exactly as the flat-file and take the intersection of

SELECT COUNT(*) WHERE (exam=z AND answeri=y)
with
SELECT COUNT(*) WHERE (exam=d AND answerj=c)

2. Append all the records for a given student to make one 2,500 column record per student and search

SELECT COUNT(*) WHERE exami_answerj=y AND examb_answerc=d

3. Give each of the possible 25,000 answers to each multiple-choice question a unique 5-digit identifier and its own column, then just signify if that answer was given by a "1" in its column, and search

SELECT COUNT(*) WHERE ansxxxxx AND ansyyyyy

4. Make a separate table for each possible answer with studentID as the key, then take interstection of searches.

5. Make a separate table for each question, leaving the multiple-choice answer in the 2,500 tables and searching for the desired answer

But as I said, I'm hoping for some method or analysis by which to make/justify the choice. "I think XXX would be best" won't impress my boss much. Is there a way to choose - preferably quantitatively? Or does one just have to build, benchmark, and repeat?

Comment on SQL database architecture
Re: SQL database architecture
by perrin (Chancellor) on Nov 03, 2003 at 19:54 UTC
    It sounds like you haven't done much work with relational databases before. In general, the answer is to make a good, normalized schema and then adjust it afterward if something doesn't perform the way you expected. It's unlikely that you will have any performance issues with such simple data on such a fast database, and a normalized schema is so much easier to work with in your program that it will save you time.

    If you aren't familiar with normalization, you can look it up on SuperSearch. I recall some good links that Ovid posted at one point. In this case, I would probably have a student table, an exam table, a question table, and a student_question table to hold answers.

      Dittos to perrin.

      That and optimized indexes are what is going to make the difference for maintainability and performance.

      Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: SQL database architecture
by jdtoronto (Prior) on Nov 03, 2003 at 20:44 UTC
    Again, ditto's to Perrin.

    There is a number of good database tutorials on this site.

    But to add: If you use normalised tables the way Perrin has suggested, where the student table has an ID which is used in the table of exams take to link back to the student, and also in the exam questions answered then you can flexibly add and/or delete form teh questions answered and other tables without having to backtrack through the other tables.

    Putting in a 'multi-attribute' record (such as questions answered in your description), might have been okay in a system like Pick, which supported just this trick, but not in an RDBMS.

    So you would need a table for STUDENT, a table for EXAM, a table for QUESTION and an ANSWER table which ties together student and question.

    In this way you can then get all of the data you want with a series of fairly simple SQL constructs using JOINs to get the final result set. Any of the good books on MySQL (MySQL Reference Manual, from O'Reilly, MySQL by DuBois from SAMS are my two favourties and the MySQL Cookbook also from O'Reilly is very useful too) will give you an entre into SQL queries.

    jdtoronto

Re: SQL database architecture
by etcshadow (Priest) on Nov 03, 2003 at 21:00 UTC
    The other folks' comnents++... but to get you started, the obvious first guess at a database structure is this:

    The tables would be (choose your own naming convention... you don't have to agree with mine): EXAM, QUESTION, ANSWER, and STUDENT, STUDENT_EXAM, QUESTION_ANSWER.

    • EXAM would hold one row for each exam that was given.
    • QUESTION would hold one row for each question on each exam; it would hold a foreign-key to EXAM
    • ANSWER would hold a row for each multiple choice answer for the various questions; it would hold a foreign-key to QUESTION
    • STUDENT would hold a row for each student
    • STUDENT_EXAM would hold a row for each student, for each exam; it would hold a foreign-key to both STUDENT and to EXAM
    • QUESTION_ANSWER would hold one row for each QUESTION on each STUDENT_EXAM; it would hold foreign-keys to STUDENT_EXAM, to ANSWER, and to QUESTION (the foreign-key to QUESTION might be somewhat redundant (since you could tell which QUESTION the ANSWER was for) but only if it were a multiple-choice question and the student did not leave the answer blank... and your data-model might not want to enforce that constraint)

    Basically, EXAM, QUESTION and ANSWER represent the abstract exam... a template for each possible student exam. The real data of interest ends up residing in QUESTION_ANSWER... this is where you ultimately say that student "Fred", on exam 3, answered "c" to question 17.

    Anyway, if that is a good start for you, then good luck... if it zoomed right over your head, then you need to read yourself up a little bit on relational databases. Then read this again.

    Update:
    jdtoronto's reply hadn't been posted yet, when I started writing my reply... which is somewhat redundant of his (sorry, JD... typing lag).

    Anyway, just want to avoid confusion: the table-names he (or she?) uses do not mean the same things as mine do. Where he says EXAM, I say STUDENT_EXAM, and where he says ANSWER, I say QUESTION_ANSWER. Maybe I would have been more clear if I had said "POSSIBLE_ANSWER and ANSWER" as opposed to "ANSWER and QUESTION_ANSWER" (respectively)... oh, well.


    ------------
    :Wq
    Not an editor command: Wq
Re: SQL database architecture
by graff (Chancellor) on Nov 04, 2003 at 03:43 UTC
    If you had any sort of design that put a single answer per row in some table, the nature of the questions being asked would require that the row also identifies the particular student and the particular exam for the given answer. That table would have 500 million rows (20,000 students * 500 exams * 50 answers per exam). A single flat table of this sort would handle the application you describe, but I'd be uncertain about the response time for querying a table of this size.

    On the other hand, your existing flat file has 10 million rows (20,000 students * 500 exams) -- a lot less that 500 million, but then each row has an additional 48 columns.

    So if some sort of flat table seems appropriate (and frankly, it probably is, given your description), then the choice of flat table design would hinge on relative performance given 50 times more rows vs. 48 additional columns per row.

    If it were me, the first thing I'd try would be fewer rows / more columns. In this case, your initial guess about the kind of query operation was a bit off. No doubt you would figure this out yourself as soon as you set to work on it in earnest, but I think the queries would be more like this:

    -- count people who said "a" on exam 1,question 20 and -- "c" on exam 43, question 37: select count(*) from the_table where exam_id=1 and answer20='a' and student_id in (select student_id from the_table where exam_id=43 and answer37='c')
    Of course, as soon as you need to intersect on three or more different questions, each from a different exam, you'll probably want to try using a perl hash array with a set of simple queries: each query just selects student_id's where 'exam_id=? and answer_X=?", and you increment hash elements keyed by student_ids after each query. Then pick out the hash keys whose value equals the number of queries performed.

    It would probably be worthwhile to test the two flat-table alternatives (4 cols * lots of rows vs. 52 cols * fewer rows) on some suitably large sample of data, to see which one is easier/faster/more flexible.

      Hmmm... as far as jamming all of the questions into the exam table, I'd argue that you're falling into an all-to-common trap: you're prematurely optimizing. Worse, you're trading potentially useful flexibility and definite ease-of-use for performance... and before you've seen whether or not this delta in performance is important.

      Addtitionally, I wouldn't guarantee that there is a significant performance gain in what you describe (squeezing what *should* be an extender-table into columns of a table). Why do I refute the claims to increased performance? Well, I'll break it down into the two main components for database performance: CPU time and I/O (either disk or buffer).

      CPU: you're adding the same number of values, either way... by adding 1 out of 50 values in 50 times as many rows, or adding one value for each row (but only 1/50th the number of rows)... it's all the same amount of time spent summarizing whatever data you are querying.

      I/O: it's important to understand how databases organize data on disks (and bear in mind that I/O buffers in memory usually mirror the disk structure... just faster, so keep in mind that what I say about disks is pretty much applicable to ram buffers, too). They do so by sticking rows of the table, packed one against another, onto disk blocks. The "wider" the row (that is, the more bytes it takes to store all of the row-data) the fewer rows can fit on any one disk block. The limiting factor in terms of I/O (again, whether from disk or buffer) is the number of blocks that have to be read. If the table is 50 times wider, and 1/50th the number of rows, it will occupy roughly the same number of disk blocks, and therefore, it will take just as long to read in the whole table.

      Of course, if the query is highly directed and can make use of indexes, then, again, there's no performance difference.


      ------------
      :Wq
      Not an editor command: Wq
        Yes, these are very good points -- it's the same quantity of data no matter how you slice it up. I just said that the wider table would be the first thing I'd try -- and I did say I'd want to test the other way as well. 8^)

        Since the wide table is just like the original flat file, there's a slight bit less work setting up and loading the DB, but that's not necessarily a good reason to pick it. Meanwhile, with the narrower, longer table, the example queries would be of the form:

        select count(*) from long_table where exam_id=? and question_id=? and +answer=? and student_id in (select student_id from long_table where exam_id=? and question_id=? and answer=?)
        (with the parameters for the sub-query being different from those for the main (count()) query); and for more complicated conditions, doing queries for single "exam_id/question_id/answer" tuples and tracking the intersection through a perl hash array.

        Looking at it this way, I have to agree with you, that the long/narrow table will likely end up being better over all, because a single query with three paramaterized conditions ("where exam_id=? and question_id=? and answer=?") will do the work of 50 different queries that would all have to be constructed and prepared separately on a wide table:

        "where exam_id=? and answer_$x=?" for $x (1..50);
        And of course, there are bound to be flexibility issues that the OP didn't mention (or hasn't realized), and the narrow table will make those easier as well.

        update: As etcshadow surmised below, I am not especially adept at doing SQL joins; in the absence of proper training in RDB methods, I tend to fall back on habits of thought I've learned in other (non-RDB) areas. Up to now my limited skill with joins applied merely to combining distinct data from related tables based on a shared key, not trying to combine "disjoint" conditions on a single flat table, as shown in tilly's reply below; I'm very grateful and much better off for having seen that -- thanks! (And apologies for the off-beat advice above.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://304211]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2014-09-17 20:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (99 votes), past polls