http://www.perlmonks.org?node_id=379296

You're working on a non-trivial application and you have a set of classes that correspond to how your application moves through the data. Since most non-trivial applications need to have some sort of permanent datastore, you need to put those data structures somewhere. You've been using Storable, but that isn't searchable on disk. Everyone's been talking about these database things, so you decide to throw MySQL / PostgreSQL / SQLite / whatever onto your box and use it to reduce your RAM requirements. You map each class to a table, each attribute to a column in that table, and put foreign-key relationships between your has-a relationships. You add a bunch of indices and create a lazy-accessor mechanism to both reduce your RAM requirements as well as reduce your requests to disk. You've done everything correctly, right?

Well, you might have. If everything works for you, then there's no problem. But, if you need more performance out of your existing system(s), I have a few tips for you.

  1. Database decomposition is very different from class decomposition

    When you decompose your domain into a set of classes, you are looking for both behavior and data. There's a tradeoff between the two, so you might fudge a little on decomposing for data when it might make behavior easier to work with. And, that's the most common reaction I have seen. Since objects are really little bundles of behaviors, you want to make sure that your behaviors are orthogonal across your classes. It makes no sense to have instances of two different classes able to modify the balance of your checking account. But, as an implementation detail, it's ok to have instances of two different classes have some of the details of your checking account, if not the ability to modify it.

    Databases, especially RDBMSes, are different. A database does not contain both behavior and data. A database is solely data storage with a number of tricks to get exactly the data you want very very quickly. It doesn't have behaviors.1 When you decompose your domain into a set of tables and columns, you are looking for three things:

    1. What are the abstractions of my data?
    2. How can I keep from repeating information?
    3. How do I link those abstractions together with has-a relationships? (RDBMSes don't have is-a relationships.)

    These three concepts are known as normalization. There are hundreds of articles on the web on normalization, describing the myriad of ways to achieve it. Basically - you need to represent each piece of information at least once, only once, and be able to look each piece of information up quickly using a key.

    The reasons you want your schema to be at least mostly normalized are:

    • the RDBMS engine can usually find your data faster
    • your data is much less likely be corrupted
    • it is much easier for someone else to come in and understand what's going on

    As with most things, you can overdo it. Plus, there are times you will want to deliberately de-normalize your schema. I currently do this with a reporting database, to minimize the number of table-joins I have to do, given the fact that my data is static while the reports are running.

  2. Well-written OO code is inefficient

    It is commonly accepted that the amount of function calls in OO code comes with a certain amount of overhead. Normally, that overhead is acceptable, given the gains in maintainnability and programmer productivity. However, unless a lot of care is taken, you can easily overwhelm your database with a lot of small queries, most of which are hidden under layers upon layers of classes.

    A simple example would be asking an OO hierarchy how much is in your checking account.

    1. An instance of Person is created (database hit)
    2. An instance of Bank is created (database hit)
    3. An instance of Account is created (database hit)
    4. A query is made from Person to Bank to Account for current_balance

    Sounds simple, right? Well, each database hit might have actually been up to five (or more) queries. Each one is small, less than a hundredth of a second on average. But, just to find out how much is in your checking account, you used about a tenth of a second to execute your queries. That doesn't sound like much until you realize that there were up to three connections to the database. On Oracle, that's around 1.2M of RAM taken up. MySQL is better, but it's still almost 200K of RAM just for the right to talk to the database. This amount doesn't count the actual RAM used for the resultset of those queries. (One application I've worked with routinely created a connection to an Oracle database for every single query, routinely having over a 1000 connections to the server at a time. It didn't always close those connections when the query was done. Talk about your memory-leak-at-a-distance!)

    Furthermore, each query has to be composed by the application, sent to the database, parsed, optimized, then executed by the database. After all that, the data has to be sent back to the application which parses the resultset and builds and populates the appropriate object. Depending on the load of the various machines involved and the network(s) in between, that query that took a hundredth of a second to execute might have taken upwards of a tenth of a second from composition to final object population. So, your roughly ten queries can take up to a second to fully process through.

    You don't notice this total time because this isn't happening in serial - it's happening in parallel. (Hence, the three connections to the database instead of just one.) But, instead of one simple query taking a hundredth of a second, you have had to build all the scaffolding of objects, taking over a second of total processing power, just to ask one question. That 100:1 (and worse) ratio is something I've seen when using those OO/SQL tools, and the major reason why many developers refuse to use them.

  3. Getting the most out of a database is HARD

    RDBMSes are designed around the idea of a set of data. Unless you're a math major, you may have never heard of set theory, let alone understand how it can affect you. And, because "Set Theory" is a scary phrase, no-one ever tells you what this means.

    Basically, a set is a group of stuff. You can add stuff to the set, remove stuff, and ask "Is this stuff part of the set?". You can also, given more than one set, ask for everything in either sets (union), everything in both sets (intersection), or everything in set 1 that isn't in set 2 (usually implemented by some MINUS operator).

    All SQL does is provide you with a way of creating a set of data, then restricting it on certain criteria. If you think about SQL as a way of using set theory, it's a lot easier to work with.

    Why do I bring this up? Well, very few people think about objects as representing sets or set operations, even when they really are. And, if you are looking at a database as an extension of your objects, you're going to mis-design your database schema.

  4. Classes are not tables and attributes are not columns

    In other words, there is (almost) never a 1-1 relationship between them. A very good example of this would be a class that has a large number of optionally-set attributes. A naive implementation of this situation would be to create a table for the class and have potentially dozens of columns, most of which are set to NULL for any given row. A better implementation is to split the class across three tables.2

    • A table for the class and the required attributes
    • A table listing the optional attributes
    • A table cross-referencing each row from the class table with rows from the attributes table. This kind of table would contain three columns:
      EntityAttributeValue

    You would use two queries to populate an instance of this class.

    1. The query to the class table, populating the required attributes.
    2. The query to the cross-reference table, populating the optional attributes. This query would look something like:
      SELECT attribute.name ,cross_ref.value FROM attribute ,cross_ref WHERE cross_ref.entity = ? AND cross_ref.attribute = attribute.id ;

This is not to say that mapping classes to tables cannot be done or should never be done. Class::DBI, and similar distributions, provide a very good solution to a niche problem. Smaller applications and certain subsystems of larger applications which do not have a performance component can benefit greatly from the ease-of-development that CDBI provides. But, without a lot of customization, those kind of tools do not scale well.3 If you're going to provide CDBI with that kind of customization, I don't see why you wouldn't take the time to design your schema with scalability in mind. While it is true that upgrading the hardware is almost always cheaper than upgrading the software, most managers who approve the purchase orders do not see things that way.4

But, I suspect this argument is going to go the way of optimizing C code with inlined ASM. As Moore's Law continues its inexorable march, higher and higher level tools that are more wasteful of CPU, RAM, and I/O will become the norm. And, there may be good reason to embrace this changing landscape. As software becomes more and more complex in its requirements, interfaces, and functioning, treating software as a brick wall to be built in components may be more cost-effective than squeezing an extra 30% from the CPU.

Footnotes:

  1. It is possible to simulate OO-like behavior with stored procedures. Some, like mpeppler advocate this idea, and I cannot disagree with them. However, stored procedures, while within the database application, have very little to do with the core function of a database - data storage and retrieval. At best, they are syntactic sugar that can help maintain data integrity. But, that is a far cry from saying that databases have behaviors.
  2. This type of arrangement can potentially be slower, but I have also seen it be faster than the dozens of columns in one table. Plus, you get the added benefit of maintainability; working with a table that has 50+ columns is much harder than working with a three tables with a total of less than 20 columns. The SQL is written in very few places and adding new attributes can be done on the fly. Storing the optional attributes as columns requires a database change for every new attribute.
  3. I have never used Class::DBI for anything that has been productionized. I have played with it, but would not consider myself an expert in its usage. I make this claim based on seeing the output from other tools, such as Hibernate (for Java), and comparing that with hand-crafted RDBMS accessor classes.
  4. And, they often have a point. The purchase cost of a server is often 20%-30% of the actual cost of that server. There is the cost of the syadmin to run the server, the licensing of the operating system, the power supply, rack space, licensing of any applications, backup space, and other considerations. Purchasing Oracle alone can cost 4x-10x the cost of the machine, especially since Oracle is almost always licensed on a per-CPU basis.