http://www.perlmonks.org?node_id=379296

You're working on a non-trivial application and you have a set of classes that correspond to how your application moves through the data. Since most non-trivial applications need to have some sort of permanent datastore, you need to put those data structures somewhere. You've been using Storable, but that isn't searchable on disk. Everyone's been talking about these database things, so you decide to throw MySQL / PostgreSQL / SQLite / whatever onto your box and use it to reduce your RAM requirements. You map each class to a table, each attribute to a column in that table, and put foreign-key relationships between your has-a relationships. You add a bunch of indices and create a lazy-accessor mechanism to both reduce your RAM requirements as well as reduce your requests to disk. You've done everything correctly, right?

Well, you might have. If everything works for you, then there's no problem. But, if you need more performance out of your existing system(s), I have a few tips for you.

  1. Database decomposition is very different from class decomposition

    When you decompose your domain into a set of classes, you are looking for both behavior and data. There's a tradeoff between the two, so you might fudge a little on decomposing for data when it might make behavior easier to work with. And, that's the most common reaction I have seen. Since objects are really little bundles of behaviors, you want to make sure that your behaviors are orthogonal across your classes. It makes no sense to have instances of two different classes able to modify the balance of your checking account. But, as an implementation detail, it's ok to have instances of two different classes have some of the details of your checking account, if not the ability to modify it.

    Databases, especially RDBMSes, are different. A database does not contain both behavior and data. A database is solely data storage with a number of tricks to get exactly the data you want very very quickly. It doesn't have behaviors.1 When you decompose your domain into a set of tables and columns, you are looking for three things:

    1. What are the abstractions of my data?
    2. How can I keep from repeating information?
    3. How do I link those abstractions together with has-a relationships? (RDBMSes don't have is-a relationships.)

    These three concepts are known as normalization. There are hundreds of articles on the web on normalization, describing the myriad of ways to achieve it. Basically - you need to represent each piece of information at least once, only once, and be able to look each piece of information up quickly using a key.

    The reasons you want your schema to be at least mostly normalized are:

    • the RDBMS engine can usually find your data faster
    • your data is much less likely be corrupted
    • it is much easier for someone else to come in and understand what's going on

    As with most things, you can overdo it. Plus, there are times you will want to deliberately de-normalize your schema. I currently do this with a reporting database, to minimize the number of table-joins I have to do, given the fact that my data is static while the reports are running.

  2. Well-written OO code is inefficient

    It is commonly accepted that the amount of function calls in OO code comes with a certain amount of overhead. Normally, that overhead is acceptable, given the gains in maintainnability and programmer productivity. However, unless a lot of care is taken, you can easily overwhelm your database with a lot of small queries, most of which are hidden under layers upon layers of classes.

    A simple example would be asking an OO hierarchy how much is in your checking account.

    1. An instance of Person is created (database hit)
    2. An instance of Bank is created (database hit)
    3. An instance of Account is created (database hit)
    4. A query is made from Person to Bank to Account for current_balance

    Sounds simple, right? Well, each database hit might have actually been up to five (or more) queries. Each one is small, less than a hundredth of a second on average. But, just to find out how much is in your checking account, you used about a tenth of a second to execute your queries. That doesn't sound like much until you realize that there were up to three connections to the database. On Oracle, that's around 1.2M of RAM taken up. MySQL is better, but it's still almost 200K of RAM just for the right to talk to the database. This amount doesn't count the actual RAM used for the resultset of those queries. (One application I've worked with routinely created a connection to an Oracle database for every single query, routinely having over a 1000 connections to the server at a time. It didn't always close those connections when the query was done. Talk about your memory-leak-at-a-distance!)

    Furthermore, each query has to be composed by the application, sent to the database, parsed, optimized, then executed by the database. After all that, the data has to be sent back to the application which parses the resultset and builds and populates the appropriate object. Depending on the load of the various machines involved and the network(s) in between, that query that took a hundredth of a second to execute might have taken upwards of a tenth of a second from composition to final object population. So, your roughly ten queries can take up to a second to fully process through.

    You don't notice this total time because this isn't happening in serial - it's happening in parallel. (Hence, the three connections to the database instead of just one.) But, instead of one simple query taking a hundredth of a second, you have had to build all the scaffolding of objects, taking over a second of total processing power, just to ask one question. That 100:1 (and worse) ratio is something I've seen when using those OO/SQL tools, and the major reason why many developers refuse to use them.

  3. Getting the most out of a database is HARD

    RDBMSes are designed around the idea of a set of data. Unless you're a math major, you may have never heard of set theory, let alone understand how it can affect you. And, because "Set Theory" is a scary phrase, no-one ever tells you what this means.

    Basically, a set is a group of stuff. You can add stuff to the set, remove stuff, and ask "Is this stuff part of the set?". You can also, given more than one set, ask for everything in either sets (union), everything in both sets (intersection), or everything in set 1 that isn't in set 2 (usually implemented by some MINUS operator).

    All SQL does is provide you with a way of creating a set of data, then restricting it on certain criteria. If you think about SQL as a way of using set theory, it's a lot easier to work with.

    Why do I bring this up? Well, very few people think about objects as representing sets or set operations, even when they really are. And, if you are looking at a database as an extension of your objects, you're going to mis-design your database schema.

  4. Classes are not tables and attributes are not columns

    In other words, there is (almost) never a 1-1 relationship between them. A very good example of this would be a class that has a large number of optionally-set attributes. A naive implementation of this situation would be to create a table for the class and have potentially dozens of columns, most of which are set to NULL for any given row. A better implementation is to split the class across three tables.2

    • A table for the class and the required attributes
    • A table listing the optional attributes
    • A table cross-referencing each row from the class table with rows from the attributes table. This kind of table would contain three columns:
      EntityAttributeValue

    You would use two queries to populate an instance of this class.

    1. The query to the class table, populating the required attributes.
    2. The query to the cross-reference table, populating the optional attributes. This query would look something like:
      SELECT attribute.name ,cross_ref.value FROM attribute ,cross_ref WHERE cross_ref.entity = ? AND cross_ref.attribute = attribute.id ;

This is not to say that mapping classes to tables cannot be done or should never be done. Class::DBI, and similar distributions, provide a very good solution to a niche problem. Smaller applications and certain subsystems of larger applications which do not have a performance component can benefit greatly from the ease-of-development that CDBI provides. But, without a lot of customization, those kind of tools do not scale well.3 If you're going to provide CDBI with that kind of customization, I don't see why you wouldn't take the time to design your schema with scalability in mind. While it is true that upgrading the hardware is almost always cheaper than upgrading the software, most managers who approve the purchase orders do not see things that way.4

But, I suspect this argument is going to go the way of optimizing C code with inlined ASM. As Moore's Law continues its inexorable march, higher and higher level tools that are more wasteful of CPU, RAM, and I/O will become the norm. And, there may be good reason to embrace this changing landscape. As software becomes more and more complex in its requirements, interfaces, and functioning, treating software as a brick wall to be built in components may be more cost-effective than squeezing an extra 30% from the CPU.

Footnotes:

  1. It is possible to simulate OO-like behavior with stored procedures. Some, like mpeppler advocate this idea, and I cannot disagree with them. However, stored procedures, while within the database application, have very little to do with the core function of a database - data storage and retrieval. At best, they are syntactic sugar that can help maintain data integrity. But, that is a far cry from saying that databases have behaviors.
  2. This type of arrangement can potentially be slower, but I have also seen it be faster than the dozens of columns in one table. Plus, you get the added benefit of maintainability; working with a table that has 50+ columns is much harder than working with a three tables with a total of less than 20 columns. The SQL is written in very few places and adding new attributes can be done on the fly. Storing the optional attributes as columns requires a database change for every new attribute.
  3. I have never used Class::DBI for anything that has been productionized. I have played with it, but would not consider myself an expert in its usage. I make this claim based on seeing the output from other tools, such as Hibernate (for Java), and comparing that with hand-crafted RDBMS accessor classes.
  4. And, they often have a point. The purchase cost of a server is often 20%-30% of the actual cost of that server. There is the cost of the syadmin to run the server, the licensing of the operating system, the power supply, rack space, licensing of any applications, backup space, and other considerations. Purchasing Oracle alone can cost 4x-10x the cost of the machine, especially since Oracle is almost always licensed on a per-CPU basis.

Replies are listed 'Best First'.
•Re: OO concepts and relational databases
by merlyn (Sage) on Aug 02, 2004 at 14:52 UTC
    Your Perl-Objects-mapped-to-attribute-table design has already been implemented as OOPS. I wrote about it recently.

    However, I think you underestimate the cost of decomposing table records into instance/attribute/value tuples. There's a reason that large databases have nice records... you tend to fetch the values in those records in clusters, and it makes sense for the database to be aware of that, instead of managing lots and lots of little individual values.

    Certainly, there's a programmer time savings to have a transparent Object backing store. But for a large heavy application, having a proper DBA design a proper RDBMS will still be a win with existing popular technology.

    There is research work going on about OO storage, but nothing that is well deployed or available, as far as I know. And it won't be about mapping that to an RDBMS. Proper paradigms are required.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.


    update: Ouch, and after reading another response in this thread, I realized I had completely forgot about typing! A real DBA-designed record contains typing and constraints that ensure data integrity. An ad-hoc system that is merely providing backing store for Perl-side objects will not have any of that meta-information available, and must rely on the Perl side to manage any data integrity and typing rules.

      Out of curiosity, what did you think about the license of OOPS? While I can understand the author's intent, the license is a bit, er, cumbersome. Actually, let me be blunt: the license is bloody awful and there is simply no way I would ever use that software. For example, why are "Internet Journals" excluded from the license? Does that mean they don't have to abide by it or does it mean that author doesn't want them using the code? Also, I can't use it if I've applied for a software patent, unless such application is a "defensive move." Well fine. I'll simply announce that all of patents (if I had any) are "defensive moves."

      And while we're at it, what's this bit about "Entities that share a 5% or more common ownership interest with an excluded entity are also excluded from this license." So if I own a mutual fund and the fund manager temporarily bumps me over the threshhold I'm not allowed to use the code? Or does "excluded from the license" mean I don't have to abide by it?

      One of the core principles of economics (something that many economists seem to forget) is that there is a tradeoff between efficiency and fairness. Trying to go too far to either extreme tends to be a bad idea. In this case, while I laud the author's intent, the implementation of the license is terrible and, I'm sure, unenforceable. However, it's bizarre and restrictive enough that I simply can't use that code.

      Cheers,
      Ovid

      New address of my CGI Course.

        This bit:

        Should any part be deemed invalid or otherwise contradicted, the entire license is invalidated and no right to use, copy, modify or redistribute exists.

        Likely means that nobody has the right to do anything with OOPS software, since there are some pretty clear legal problems with this license.

        I also like the preamble:

        I wish to let people use this software for free as long as they don't abuse the free software community or Internet in particular ways.

        Both the Free Software movement (as defined by the FSF) and the Open Source movement (as defined by the OSI) would not like this license. Both groups explictly forbid licenses from excluding specific groups, whereas this license spends sections 5-11 doing exactly that. The author seems to be giving lip service to Free Software while not understanding its fundamental points.

        ----
        send money to your kernel via the boot loader.. This and more wisdom available from Markov Hardburn.

        Personally, I find this part the most confusing/disturbing:

        IV. All useful modifications to this software must be licensed back to the original licensor such that the original licensor can relicense the modifications to third parties under terms of their own choosing with no restrictions whatsoever and no need to even acknowledge the origin of such modifications. There are two easy ways to accomplish this: put all such modifications into the public domain or assign the copyright to the original licensor.
        This is basically saying, if you do anything useful with his stuff, he wants it back, and wants to own it (at least the rights to it). But yet he in no way is offering to pay for (or even just give credit for) your time/effort. Why would I want to do that?

        From my (limited) understanding, putting something in the public domain is not the same as releasing something open source. When you release it open source (say under the GPL), you are saying, anyone can use it for free, just as long as I retain credit for it (after all I did the work). But releasing into the public domain does not have the same "restrictions", its basically just releaseing it for anyone to do anything too.

        I agree with you on this Ovid, I would be very hesitant to use this software, if at all.

        -stvn
        Out of curiosity, what did you think about the license of OOPS?

        It sucks. A demonstration of why lawyers are useful. Almost every clause wrong in so many interesting ways.

Re: OO concepts and relational databases
by stvn (Monsignor) on Aug 02, 2004 at 16:40 UTC

    Very nice Meditation! I very much agree with you on most points. It has long been discussed that there is a mismatch between Relational databases and OO programming. A quick aside to link to some of that information for those interested:

    I recently read an article on the Relational-Object impedance mismatch, its pretty good, it talks about a lot of what you are saying, and also introduces the idea that part of this mismatch might also be a cultural thing.

    This site as well has a lot of good information as well as links to even more good information on the topic. In particular, the section on "performance" has some interesting items about Denormalization (and links about it too).

    And there is always the c2.com wiki, here are some pages from that: Object Relational Psychological Mismatch and Are OO And Relational Orthogonal Discussion (I have not read all of these pages, my eyes glaze over after a while).

    And now back to your regularly scheduled mediation response :)

    1. Database decomposition is very different from class decomposition
    I cant argue with your first point, although, I think that maybe what really needs to happen on both ends is a relaxing of the almost maniacal drive for "purity". I tend to agree with the author of the article mentioned above, that some of this mismatch may be cultural, or at least that to solve it, the cultural aspects of it cannot be ignored.
    2. Well-written OO code is inefficient
    I cannot argue the details of this, but I do think that maybe it is not as true as it used to be. In the old days of LISP, the big argument was that LISP was slow since it was interpreted, so we should all use FORTRAN or COBOL or some other langauge-of-the-day. Over the years though LISP has gotten a lot faster, due not only to having a number of very smart people working on the problem over 30 or so years now, but also because compiler technology and optimization ideas have evolved with it. Take a look at Standard ML's compilation model to see where functional languages have gone.

    I think the same is happening with OO, although again, I cannot argue it in detail. But it seems to me that there is plenty of work on the ideas of improving the OO compilation model, and optimizing things like method dispatch, and such. After all OO is all about abstraction, so eventually the optimization of OO code will get abstracted away by the compiler.

    However, your meditation is not about OO compilation techniques :)

    Sounds simple, right? Well, each database hit might have actually been up to five (or more) queries.
    I would argue that if it takes up to 5 queries for each DB hit, you might want to rethink your approach to this.

    Personally for things like a Bank object, I like to use long-lived Singletons for this, which hold all but the more volatile data in their instance (Of course, this will not work in all scenarios, but for the sake of argument/example ...).

    I also have experimented with creating multiple "views" of Person-like objects, in which I retrieve certain sub-sets of the data. Obviously when I am looking for the checking account balance, I don't care much about personal information like the address of the Person. Of course this ends up being a compromise in the end, because I end up having to supply more information than normal about my "Person" needs. But, like I said, I am still playing with the idea, so we shall see.

    My point here is that if this is a non-trivial application, and you have chosen to go with an OO-Relational mapping paradigm, you would likely not implement it in a fashion that would be naive to your performance/scalability needs.

    Getting the most out of a database is HARD
    Of course, thats why you hire a DBA :-P

    Unless you're a math major, you may have never heard of set theory, let alone understand how it can affect you.
    Funny, I was a Fine-Arts Painting major (a.k.a. - studying to be a waiter), and I have heard of set-theory :).
    Quick aside:
    I have been told to that this book: Naive Set Theory is a good introduction to it, although I have not yet read it myself. Any other set-theory book recommendations are appreciated.
    If you think about SQL as a way of using set theory, it's a lot easier to work with.
    This is very true, although it only works if you understand set-theory. I also think it helps to know about state machines and Finite State Automata to understand regular expressions as well. But I think for many people, knowing "theory" is more a barrier to entry than it is a helping hand.
    Why do I bring this up? Well, very few people think about objects as representing sets or set operations, even when they really are.
    In my (admittedly limited) experience, I have more often seen the relationship between object theory and set theory referred to when discussing inheritance, and other such "under-the-hood" things. So I am not sure exactly what you mean with this though.
    And, if you are looking at a database as an extension of your objects, you're going to mis-design your database schema.
    The inverse is true as well, if your objects are to be subservient to your data, then your class design will likely be bad. Again, its about compromise IMO, you can't have the best of both, you need to pick and choose what best fits your applications requirements/needs.
    4. Classes are not tables and attributes are not columns
    Of course they aren't, if they were, things would be a whole lot easier, and you would have never needed to write this meditation. You present several good ideas here, the Entity attribute table in particular. I have seen similar things both praised and shunned by OO programmers and DBAs.

    In the design and development of a non-trivial application, it is naive to assume class-to-table and attribute-to-column relationships. I again will say that this comes down to compromise, and doing whats best for the problem at hand.

    But, I suspect this argument is going to go the way of optimizing C code with inlined ASM. As Moore's Law continues its inexorable march, higher and higher level tools that are more wasteful of CPU, RAM, and I/O will become the norm. And, there may be good reason to embrace this changing landscape. As software becomes more and more complex in its requirements, interfaces, and functioning, treating software as a brick wall to be built in components may be more cost-effective than squeezing an extra 30% from the CPU.
    I disagree with this, I think that to say this in the end will come down to an issue of optimization is possibly simplifying the argument (and surely I too am simplifying your closing statement).

    I think that the problem with OO-Relational mapping (both the available tools and some of the current prevailing "wisdom" on the subject) is that they are trying to create a technological middle-man between two technologies/paradigms/ways-of-thinking that don't fit together easily. The logical and rational roads that lead to better, cleaner more elegant solutions in both tend to move in (somewhat) opposite directions from one another. To me this says that any and all attempts to keep this middle-man solution happy will require too much work to keep up with the changing landscapes on either side.

    IMO, what is required is not the middle-man approach, but instead a partial re-thinking of both. A compromise, which is not ad-hock, but a smart combining of the bests of both worlds. Keep in mind that this is all still a relatively young field (as is Computer Programming in general), and we are just seeing the beginning of it now.

    Again, excellent meditation, one of my favorite topics of late. Thanks for the post.

    -stvn
      Off-topic response to your side-note.

      I have read Naive Set Theory, and may even have a copy somewhere. It is a good introduction to set theory for an advanced math major or a graduate student in math. It is only naive when compared to the treatment that a logician would give - it is basically "what mathematicians need to know about set theory in a nutshell". If you have a burning desire to understand how you get from the Axiom of Choice to Zorn's lemma to the Well-Ordering Principle and then back to the Axiom of Choice (thereby proving that all 3 are equivalent claims), this is your book. Otherwise I wouldn't recommend it.

      If you're just interested in getting a sense of what higher mathematics is like, I would highly recommend something like The Mathematical Experience instead.

      I have been told to that this book: Naive Set Theory is a good introduction to it, although I have not yet read it myself. Any other set-theory book recommendations are appreciated

      E. Kamke's Theory of Sets. There are other introductory set theory texts in the Dover catalog, as well.

      qq

Re: OO concepts and relational databases
by Solo (Deacon) on Aug 02, 2004 at 15:13 UTC
    A table cross-referencing each row from the class table with rows from the attributes table. This kind of table would contain three columns:
    EntityAttributeValue

    I've used this approach myself, but it comes with some drawbacks.

    By coercing all values into a single column, you lose datatype information. If the optional data (or sparse data, let's say, since optional data that is dense is probably better stored directly in the table anyway) is all of one type this isn't really a problem. But if numerics and dates are sprinkled in, the design gives up the ability to easily construct SQL statements involving conditions on those fields. Views can help with this to an extent.

    This structure is not convenient for storing both attributes that may have one value and attributes that may have many. This choice represents either a unique key pair of object and attribute or no key. An additional column for attribute 'order' would allow a unique key in these circumstances, but is unnecessary for single-value attributes. A bit of a data-integrity headache. I used two tables (one for sparse single-valued- and one for sparse multiple-valued-attributes).

    I don't have a ton of experience across opensource RDBMS, but what little I have tells me join speeds vary greatly. YMMV.

    ...adding new attributes can be done on the fly. Storing the optional attributes as columns requires a database change for every new attribute.

    Indeed, this is really the only way I've seen 'End-User' specified attributes handled in RDBMS without allowing schema changes. If anyone has seen different implementations, I'd be interested in hearing about or seeing them.

    --Solo
    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
Re: OO concepts and relational databases
by diotalevi (Canon) on Aug 02, 2004 at 16:08 UTC

    There's a separate idea dealing with object identity. In the database I've stored a fact( name=diotalevi, homepage=http://grenekatz.org ), in your OO world you may have multiple in-memory objects all dealing in the same fact. When you update an object representing a record would that update all the other objects tied to that fact? Or what of transactions? Or what of comparing objects for identity? If I have two objects pointing to the same record would I compare them for perl-object equality or record-equality? I would tend to thing that things such as eval { $oa->{'homepage'} = '...'; ... die }; ... = $ob->{'homepage'} would properly note that updates to $oa only propogate to $ob if $oa's update transaction completed successfully. Or maybe $oa and $ob should be considered part of the same connection so uncommitted updates would be visible to both but not to a hypothetical $oc.

    This is something additional for you to think about anyway.

Re: OO concepts and relational databases
by jZed (Prior) on Aug 02, 2004 at 17:24 UTC
    You wrote:

    Entity Attribute Value

    While there are some cases in which this is a good idea, it can also lead to really confused things like this:

    Entity Attribute Value
    Elephant Color Gray
    Workstation Linux_distro Debian

    What happens to constraints and data types when each new row potentially introduces a new data type? What meaning does the concept "entity" have if there are no limits on the kinds of attributes it can take and the range and type of data the values for those attributes can have? I realize that you know what you're doing and probably only use this kind of a table in appropriate circumstances (and you did specify an example in which there were multiple optional attributes which is not the same as the example I've shown), but this kind of design seems to be a favorite with beginners who don't realize that it can sabatoge any notion of structure in the data.

    You wrote:

    Storing the optional attributes as columns requires a database change for every new attribute.

    The ability to change the structure of the database is an essential feature of relational databases. Sure, you want to avoid it when you can, but to label it a bad thing (tm) throws out the entire relational baby with the bathwater.

      The ability to change the structure of the database is an essential feature of relational databases.

      Very much so. And that is one (of several) good reasons why SQL should not be embedded in application programs. Leastwise, not object oriented ones. And application programmers should not be designing DB schemas.

      Applications (should) deal with application objects. Objects represent state, and flows of state, not data nor relationships.

      That doesn't mean that RDBMSs can't be used to underpin application programs. It just means that there should be an interface layer between the application program and the SQL.

      The make-up of the schema (tables and columns) is driven by the relationships inherent in the data. Different applications have different requirements of each relation. Read the bank balance; credit it; debit it; erase the entire relationship.

      The most prevelent and effective mechanisms for these are writable views and stored procedures. Just as the Class decouples the user code from the internal representation, so views and/or stored procedures decouple the Class from the datastore schema.

      That allows the DBA to amend the schema to incorporate new requirements for new applications, and tune it for a broad range of existing applications. Each application can have different logical and performance requirements, as well as different privilege levels to any given table or column.

      The impedance between most development languages, procedural and object oriented, is at so many level--security, data integrity, concurrency and latency--that using them to directly manipulate relations is like using oven mitts to do watchmaking.

      The only programming language I am aware of that really has the syntax and semantics for dealing with relational data natively is APL. It's set operators are very well matched to relational theory.

      SQL is not (in my book) a good interface between them. It is a handy language for relatively simple queries that interogate the relationships and make set adjustments to them, but as an abstraction layer, it sucks.

      Think of the times that you knew what you wanted to do, knew that the schema was capable of doing it, but get hung up on trying to frame the requirement into a form that SQL would accept. The frequency with which you will see large volumes of data pulled from a DB only to be subsetted with the application code highlights one level of impedance.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: OO concepts and relational databases
by adrianh (Chancellor) on Aug 02, 2004 at 19:11 UTC
    But, as an implementation detail, it's ok to have instances of two different classes have some of the details of your checking account, if not the ability to modify it.

    I'm not sure that I'd agree with that myself. In my experience duplicating application state in multiple classes leads to the same sort of problems that duplication in a DB cause.

    If I see duplicate state in a class hierarchy my first thought would be that there is a missing abstraction to be discovered, and my second thought would be that this would be a good area to look for bugs.

    Not that I disagree that "Database decomposition is very different from class decomposition", because it is ;-)

    Well-written OO code is inefficient

    I agree with what you're saying here, but it's nothing to do with well written OO code being inefficient. Quite the opposite.

    If an application needs to open four database connections for four queries we have a bad design. Badly written OO code being inefficient isn't really surprising.

    (Interestingly enough I've seen similar problems from the opposite direction - with RDBMS people doing OO coding and creating lots of unnecessary "join" classes leading to hideously inefficient bloated code. Again this says nothing about the relative efficiency of OO or RDBMS models - just that they're different and you need to use the right one in the right place.)

    Getting the most out of a database is HARD

    I don't think it is actually. Don't scare people away!

    At least, it's not any harder than getting the most out of an OO decomposition, or getting the most out of a functional decomposition.

    It needs to be learnt, and more people need to learn it, but the basics are not that complicated.

    The hard bits about databases aren't really the general theory, but the nitty gritty specifics about individual implementations.

    Classes are not tables and attributes are not columns

    Amen. Twice. In bold. Underlined.

    Too many OO developers treat RDBMS as suboptimal object persistence layers, which is kind of like treating a train as a suboptimal car. True, but somewhat missing the point.

    Smaller applications and certain subsystems of larger applications which do not have a performance component can benefit greatly from the ease-of-development that CDBI provides. But, without a lot of customization, those kind of tools do not scale well.

    I'm not sure what you mean by "scale well" here. I've certainly used CDBI in large production systems without any problems.

    It's certainly not suitable to all systems (because it has that one-table-per-class concept built into its core), but where it is applicable it works well in my experience. Where it doesn't work well you can always hand roll your own, or use one of the more flexible OO-RDBMS mappers.

    However, stored procedures, while within the database application, have very little to do with the core function of a database - data storage and retrieval. At best, they are syntactic sugar that can help maintain data integrity. But, that is a far cry from saying that databases have behaviors.

    Maintaining data integrity can hardly be called "syntactic sugar"! (and what else are we to call stored procedures, triggers, etc. if not behaviour?)

      Maintaining data integrity can hardly be called "syntactic sugar"! (and what else are we to call stored procedures, triggers, etc. if not behaviour?)

      Data integrity is orthogonal to stored procedures and triggers. Data integrity is maintained by normalization and foreign keys1. Triggers and stored procedures are the ability to use a programming language that is supported by the database engine. They are not necessary for data storage or to maintain data integrity. In fact, the entire idea of stored procedures is marked as "implementation-specific" in the ANSI-92 standard.

      Now, triggers can be used to help maintain data integrity. The most obvious case is if you have a schema that has been deliberately denormalized for performance reasons. If you make a modification to table A, table B might need to be updated in an atomic fashion, to maintain the denormalization. However, that is outside the basic functions of an RDBMS.

      As for triggers and stored procedures being behaviors ... I still maintain they are not truly behaviors, in the OO sense. The table doesn't have a behavior that is associated with it. If anything, the behavior is associated with the action that triggers it, not the table the action was performed on.

      Another item to note is that certain things that require triggers and sequences in one RDBMS (Oracle) can be done with a set of keywords in another (MySQL). I'm speaking of the AUTO_INCREMENT keyword in MySQL that can only be duplicated by using a trigger and a sequence in Oracle. The keyword may be implemented as such under the hood, but it also may not. Regardless, it's still syntactic sugar.

      1. Granted, most RDBMSes implement foreign keys as triggers, but that doesn't mean that the triggers are necessary, or even sufficient, for foreign keys. MySQL, for example, requires that the referenced column have an explicit index on it (in the InnoDB tabletype), while Oracle does not.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        Data integrity is orthogonal to stored procedures and triggers

        Depends on how you define data integrity. If I have a banking application where it should be impossible to update the value of the balance column of the current_account table without there being an appropriate entry in the debit or credit tables I would call it a data integrity issue, and enforcing it with stored procedures and/or triggers would seem a sensible option.

        The table doesn't have a behavior that is associated with it. If anything, the behavior is associated with the action that triggers it, not the table the action was performed on.

        I think I'll just have to say to-may-to / to-mah-to on this one. As far as I'm concerned when I do something like:

        CREATE OR REPLACE TRIGGER enforce_business_rules AFTER UPDATE ON current_account ... etc ...

        I'm doing something to the current_account table. YMMV :-)

Re: OO concepts and relational databases
by dfaure (Chaplain) on Aug 02, 2004 at 18:00 UTC

    Getting back to the OO concepts (and btw answering the quiz I proposed here), you have 3 available ways of modeling your data relations:

    Aggregation: aka "is part of"
    The datas are modeled as highlevel objects containing sub objects of lower level on the divide to conquer priciple.

    Association: aka "is linked to"
    This relation maintains the relationships between the different parts of the objects model constituted with the datas.

    Heritage: aka "is a"
    The most famous concept of OO culture (perhaps because of its specificity), and certainly the badly used. A very common mistake of OO beginners consists in believing that everything should herit from something else without asking themself before all if this object is really a kind of that object or not.

    The only real common data relation shared between OO and relationnal data model is the association. The two other one may be more or less emulated with frustrating results and poor performance as other monks have already pointed out in this thread.

    So, why not try to use a OODBMS instead of a RDBMS? In other terms, something more dedicated to the way datas is described as objects. Despite they are still less known and used than RDBMS*, some engines are up and ready, waiting you to try them, even with perl (look for Pogo)!

    ____
    * Sorry to say that, but my signature first favorite citation totally applies here... ,)

    ____
    HTH, Dominique
    My two favorites:
    If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
    Bien faire, et le faire savoir...

      There are two reasons why I didn't address OODBMSes:
      1. MySQL, Oracle, SQLite, and PostgreSQL are all RDBMSes. As those are the most popular options mentioned here, I chose to discuss them. I actually haven't seen much, if any, discussion of OODBMSes on this (or any other) site I go to.
      2. I've never used one, so I don't have any experience to discuss them. I have, however, a lot of experience with Oracle and MySQL in a number of companies. :-)

      You may be perfectly correct in your assertion of OODBMSes. I don't know. I'd certainly be interested in learning more, though ...

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        There are two reasons why I didn't address OODBMSes

        And you're perfectly right. For now OODBMSes are still more than ever small exotic things compared to mastodontic RDBMSes (commercial rules applies...).

        In fact, I always found RDBMSes complex and difficult stuff, even a bit archaic (the gap from sequential-indexed files is not so wide).

        Here's some interesting reference about OODBMSes:

        ____
        HTH, Dominique
        My two favorites:
        If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
        Bien faire, et le faire savoir...

      The only real common data relation shared between OO and relationnal data model is the association. The two other one may be more or less emulated with frustrating results and poor performance as other monks have already pointed out in this thread.

      Relational models often include 'IS A' relationships for many of the same reasons OO designs do. Supposing you have a table Vehicle, you might need tables Truck and Motorcycle to hold information specific to only some of the rows in Vehicle. This is commonly modeled as a 1 to 0 or 1 relationship between the generic table and the specific table, but other implementations are possible. My results have never been frustrating, and performance is not generally a factor in this design choice.

      --Solo
      --
      You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
        Supposing you have a table Vehicle, you might need tables Truck and Motorcycle to hold information specific to only some of the rows in Vehicle.

        You're describing here a design rule able to emulate an heritage, which is mostly a matter of methodology/dba experience.

        With a (real) OODBMS, this kind of trickery relational modelisation is not required.

        ____
        HTH, Dominique
        My two favorites:
        If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
        Bien faire, et le faire savoir...

Re: OO concepts and relational databases
by ysth (Canon) on Aug 02, 2004 at 16:30 UTC
    How do I link those abstractions together with has-a relationships? (RDBMSes don't have is-a relationships.)
    But is-a can be thought of as just a special case of has-a. The database you are using right now uses is-a relationships extensively.
Re: OO concepts and relational databases
by BrowserUk (Patriarch) on Aug 04, 2004 at 00:30 UTC

    I've been following along on this thread and (mostly)reserving comment but..

    What you are advocating, is to design your database schema to reflect the way your programs choose to model the underlying data.

    This is soo wrong.

    DB schema's should (can only) reflect the structure inherent in the data; not how one particular application chooses to model (some subset of) that data.

    What if a new application needs to model (a different subset of) that data in a different way?

    You then have two choices.

    1. Adapt the schema to the new application and re-write the first application to suit.

      This will inevitably leave the schema a compromise between the requirements of both applications, throw away any DB (index) tuning that had been done for the first application.

      It can also lead to the situation where the needs of one application are irreconcilable with the other.

    2. Duplicated the data into two databases, each schema'd for the particular application.

      Hopefully this choice needs no annotation.

    The advantage that using stored procedures, writable views and triggers to represent the business logic is:

    1. You can leave the database schema to the DBA. He can normalise the data, setup the referential integrity according to the structural needs of the data itself.

      He can the tune the DB (indexes etc) over time in accordance with the actual loads placed upon the DB in use.

    2. The applications get access to their preferred model of the data through stored procedures and views.

      This places most of the heavy hitting directly in the DB itself, where is is best handled and best tuned.

      1. Each application can have what ever view of the data it requires.
      2. Multiple applications can share the same data, through different views without compromising the integrity or perfromance of the underlying schema.
      3. The applications access to the data is controlled at the business level. Changes in business rules can be tested, proven and instigated at the DB level and (can be) invisible to the application code.
      4. As applications evolve, their view of the data, and the makeup of their subset of the total dataset, can change and be accomodated without necessarily modifying the underlying schema.
      5. New fields and relationships can be added to the schema transparently to existing applications.

        Achieving this level of isolation is hard.

        That's why schema design lies in the province of the specialist DBA not the generalist application programmer.

        It's also why good DBA's are worth their weight in gold (and usually know it!), and bad ones can bring whole departments and companies to their knees.

    3. The business retains control over their data.

      The data for individual applications all lives in one large business database rather than lots of little databases.

      Even if those small databases all live within the same RDBMS, writing new applications that draws together relationships across the boundaries of those small DBs becomes impossible.

      Attempting to correct this by making applications use one or more tables within a single DB results in duplication of data, and/or compromising one application's needs for another.

    The solution, invented long ago by someone far clever than myself, is writable views and stored procedures (and triggers to ease the use of SPs).

    The only cogent argument I've seen against the use of SPs and views to implement business rules and provide the isolation layer between applications and the schema is the difficulty that arises if the RDBMS vendor is changed.

    This is only a major problem if:

    • The new RDBMS doesn't provide for the same functionality as the old.

      Moving from PgSQL to MySQL would be a problem currently.

    • If too great a reliance is made of the extended features provided by one RDBMS.

      Avoiding 'lock-in' by avoiding 'extended features' is a problem in all fields.

      Weighing the decision to use such facilities and risk lock-in is a difficult compromise to make and depends entirely upon the long term benefits of using such facilities.

      As much as any other reason, that is why such extended features are evolved & marketed, and why it takes so long for such extensions to be unified into standards.

    The truth is that changing DB vendor is always going to be traumatic if the facilities of the RDBMS are being used in a cross-application, joined-up way.

    There are several companies out there that will try to sell you on gaining independance from your RDBMS vendor by selling you a middleware layer in which to encapsulate your business rules.

    What they don't tell you is, that independance comes at the cost of restricting you to the lowest common denominator across all the RDBMSs that they support. Just as with all cross-platform libraries.

    And, you just moved your lock-in from the RDBMS vendor to the middleware vendor. In practice, the RDBMS vendors are generally better.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      I believe that we are talking in the same direction, but in different ways. My OP was discussing the fact that many OO developers choose to use tools that design and use schemas that are very inefficient.

      I am saying the same things you are:

      • Schema design should be left in the hands of those who know how to do it
      • A well-designed schema is so because it's focused on how the data relates to itself, not how any given application might use the data
      • Access to a well-designed schema needs to be controlled, through any number of mechanisms. These might include:
        • Views
        • Stored Procedures / Triggers
        • Some middle-tier in code, such as XML-RPC and/or other web services
      • Updates to that schema should be controlled by those who designed it

      Now, I don't go as far as to say that one method is preferable to another. The primary goal is to ensure the integrity, structure, and safety of the data. How that happens can almost be viewed as a matter of personal taste.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        I apologise. I lost sight of the purpose and content of your OP as the thread progressed over time.

        There are one or two sub-threads where you are counter-arguing assertions made by others in which you appeared to be arguing the case for things like Class::DBI and Class::Tables etc., and condeming the use of SPs, triggers and views.

        Re-reading the entire thread in sequence and context, I now realise that whilst you did argue the case against the latter, it was not in support of the former.

        My accusation was unfounded, based on my memory of what I thought you had said over several posts over several days, instead of what you actually said and the context in which you said it.

        For allowing my own prejudices on the subject, to cause me to confuse the leaves of thread, for the trunk of the argument. Again, I apologise.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: OO concepts and relational databases
by water (Deacon) on Aug 09, 2004 at 00:30 UTC
    A table is not a class, a row is not an instance, and a column is not an attribute.

    True.

    We've begun using Class::DBI as a layer over the database, but clearly the CDBI is a "storage object", not an "idealized object" (just made up those labels).

    Say we have a Zoo, full of Exhibits and Animals and Vets and so forth. We tend to divvy up our tables into logical groups across multiple databases (ease of backup, replication, etc), vs. having every table in the same database. So we might have tables of

    Animal.animal, Animal.Medical, Animal.Feeds, Animal.MammalDetail
    And in CDBI, the objects corresponding to the tables would be  Animal::Animal, Animal::Medical, Animal::Feeds, Animal::MammalDetail, etc.

    So we might have a class Animal (subclassed into Mammal, Reptile whatever), which represents the OO concept of an animal, with corresponding data and methods

    $a=Animal->new(%attribs); $a->feed(%feedinfo); $a->medical_checkup(%careinfo);
    And there's also the "storage object" class  Animal::Animal, which is just a nice CDBI wrapper around the table Animal.Animal.

    We don't force methods onto the "storage objects" (except database-esque methods), and we allow the "idealized objects" to create, hold, and use "store objects" internally to manage their persistence and interaction with the greater system.

    So yes, an class isn't a table and a row isn't an instance, but in CDBI-world they are, which is useful and fine. Just don't then try to cram your application-logic into that tiny space, for it is too small and won't fit. Let CDBI or whatever encapsulate the database (highly recommended), then use that encapsulation within a higher "idealized object" that models something meaningful.

    Perhaps this ramble makes sense to someone else besides myself... Great thread, dragonchild. Thanks.

    water

Re: OO concepts and relational databases
by exussum0 (Vicar) on Aug 03, 2004 at 11:02 UTC
    How do I link those abstractions together with has-a relationships? (RDBMSes don't have is-a relationships.)
    That's what boolean and number types are useful for.

    Simple example, a boolean type for male on a person. male = false, means it's a woman. It IS an is-a type realtionship. I could create a male and female abstract type with hard-coded methods called gender(), but that's a little inefficient.

    Further more, you can do the same with numbers. For instance, when you have an address type table, if you normalized the country out to another table and used ids, you have an isa type relationship that would be accomplished by creating hundreds of tag classes, package America, package Canada, package Barbados, package St Vincent. But that's just OOP hell. It's a little outlandish and a poorer representation than saying an address has-a country than an address is of this country's type.

    The latter example isn't the BEST one. Lets give another. Take books. They are escentially all of the same ilk. Some are text, some are novel, some are reference. I doubt someone is going to create a reference, text and novel supertype when a field describing it is more direct.

    Classes are not tables and attributes are not columns In other words, there is (almost) never a 1-1 relationship between them.
    Never use never. Someone will have an example where for them, a class->table relationship works well for them. It may not be complex, but it doesn't prevent it from existing. :)

    Bart: God, Schmod. I want my monkey-man.

      You're saying that columns can have an is-a relationship with abstract types. That's equivalent, in my mind, to saying that the attributes of an instance have an is-a relationship with the native datatypes of the language I'm working in.

      IS-A and HAS-A are more often discussed in reference to the object, not the attributes. That would, in my mind, correspond to the row in the table or the table definition itself. So, tables don't have IS-A relationships. Rows in tables don't have IS-A relationships. A given row in a table has a HAS-A relationship with every row that it is the N side of a 1:N relationship. (The confusion might be arising in that OO theory gives the container the HAS-A control while DB theory gives the contained the HAS-A control.)

      That said, you can approximate an IS-A relationship between tables, but it's cumbersome and prone to mistakes. (It's kinda like doing OO in AppleBASIC or Fortran.)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        Data is data in the long run. And you are right, "That said, you can approximate an IS-A relationship between table..." It's just a perspective that can hold true. Not that anyone should design with it in mind.

        Long run: given a row, is-a relationships can be derived by single tables sometimes. :)

        Bart: God, Schmod. I want my monkey-man.

        That said, you can approximate an IS-A relationship between tables, but it's cumbersome and prone to mistakes.

        This statement seems contradictory to your idea about a separate table for optional attributes. That is merely a further generalization of the approximate IS-A relationship, and is equally cumbersome and prone to mistakes, if not moreso.

        How is it more cumbersome to have the tables

        Animal .id .name .type Mammal .animal_id .hair_color Reptile .animal_id .num_of_limbs

        Than to have the tables you are proposing?

        Animal .id .name .type AnimalAttribute .animal_id .attribute_id .value Attribute .id .name

        --Solo
        --
        You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.