|Perl: the Markov chain saw|
Database programming blues
Dear fellow monks,
Every now and then, somebody asks for advice about a problem involving database programming. Sometimes, it is a clean cut question, with the right balance between programming language (Perl) and database issues and a clearly defined example. In these cases, the question gets answered, and everybody is back home happy.
Sometimes, however, the poster doesn't have the foggiest idea of what is wrong, and submits an example with a mix of basic language, DBI, CGI and SQL problems. Then, many people answer these questions, giving specific advice on the part they found wrong, and the basic problem stays unsolved. Not only, when the right answer comes (the one that identifies the inner problem and gives the soundest piece of advice) it may happen that it is lost into a sea of comments that advise on how to correct some macroscopic but unrelated stylistic blunders.
In such cases, I realize that the basic problem afflicting the poster - and some of the willing helpers - is a misunderstanding of database principles.
This article is for both of them, for the ones who want to ask for advice and the ones who are offering their help. An outline of database programming, with particular emphasis on Perl programming, is necessary for both asking the right question and recognizing the right answer.
This article is neither a database nor a Perl tutorial. It's just a categorization of the problems we may face when developing database applications. Although some thoughts can be generalized to any high level language, most of this meditation is specific to Perl.
Before you go on
I assume that you know at least what the DBI is. If you don't, please have a look at the tutorials on database programming, whose contents I don't want to duplicate here.
The soundest approach to programming I have ever come across was the one that explained computer programs as being composed of three parts:
The best organization we could have in our programs is to keep those three elements separated as much as we can. This way, if our input device (or protocol, interface, or whatever) changes, we have to modify only one element of the program. Moreover, if something is wrong, we can more easily identify which piece is faulty.
We can follow this approach no matter which technology we are using. In either structural or Object Oriented programming, there are resources and idioms to isolate those elements.
When dealing with databases, each of these parts could be replaced by a database interaction. Sometimes even all three. Here's an example. The first input is from a database table with the official holidays for the organization. If today's holiday, then the second input is the list of employees working today, for whom we'll calculate the appropriate wages, depending of their job and seniority (which we also get from the database). The program's output is a database update of their payroll records.
The above example is not a standalone program, but it's part of a more complex application to deal with human resources management. Depending on the business rules that small program could be just a periodic job with no human intervention, or it could be part of a stricter work-flow, where each input and output requires a supervisor's approval. Either way, having the three component separated will help to integrate this program into the rest of the application.
Source code intricacy
One additional burden is due to different source codes that you might need to use at once for your program. A typical case is a Perl script getting information from a database using SQL queries, and sending output to a web server, thus generating markup language such as HTML or XML.
There are several ways to avoid cluttering your script with many types of code at once. Query generators can hide the SQL code. Templates and wrappers can generate markup language for you. In these cases, you trade code simplicity for debugging complexity.
This problem is especially felt when you are creating SQL statements through some wrappers that replace the SQL code with some other "simpler" instructions. The more you trust a tool (especially if you don't fully understand what it is doing), the more you are exposed to side effects that you might be unable to investigate properly when hell breaks loose.
About wrappers, there are two articles in www.perl.com, the first advocating the the virtues of a wrapper to simplify your coding even at the price of a huge efficiency hit; the other, brilliantly expressed by chromatic, shows how the DBI, assisted by some useful Perl idiom, can deal with any database problem.
In this subject, I am biased toward using the most powerful tool (the DBI), even at the price of some additional typing. It is true that, when you are a beginner, you feel intimidated by the DBI, and you'd like some simplifying solution that will save you some typing and will hide those hideous database details. However, you don't want to be a beginner forever. Or do you?
Some reason for complexity in database programmingThe different languages in your script, which we have seen above, are only one symptom of the complexity involving database programming. The real complexity lies in some often overlooked properties of databases.
Modern databases work with some sort of client/server protocol. The Perl DBI accepts this behaviour as a common ground for all databases and builds all its structure around such concept.
It is important to grasp the general idea of client/server - and even better would be to understand how your specific database is implementing it.
A database server is a program waiting for a request, which could arrive through the network, via an internal socket, via a file or whatever. The basic principle is its waiting. Then the client sends a request.
The DBI supports the idea of a request being prepared, i.e. analyzed for correctness and possibly optimized, then reused several times with either the same or different parameters. Anyway, if your database doesn't support the preparation, it is conveniently simulated by the DBI.
After a request has been successfully prepared, the client can ask for its execution. Notice that nothing is assumed by the database server about your intentions. Whatever you want, it's your call. It means that the server doesn't go on unless you ask.
So you have asked the server to execute the request. What does that mean? It means that the server will execute your query, and save the results into a temporary buffer of memory, waiting for more requests.
Yes. This is the part that's difficult to come to terms with if you are used to desktop database applications, where your request is the last word, meaning that you have triggered an irrevocable chain of events that will result into a list of records on your screen. A client/server database will wait for your request before doing anything, and after each step it will wait for your next request.
So, you have asked for the query to be executed. The server complies, and now it's waiting for your next wish. For how long it's going to wait? It depends on the database, but it's a safe assumption that its patience will not exceed the life of your connection thread. When you disconnect, or your thread is timed out due to your inactivity, all the temporary space allocated to your request is freed.
It's time for you to get the results from the server. You may ask to have all your results at once or one by one, depending on your estimate of their size and on the availability of such services in your database. The methods $sth->fetchrow_array, $sth->fetchrow_arrayref and $sth->fetchrow_hashref will get you the next available record from your temporary buffer in the server. When the last record has being claimed (or when you issue a $sth->finish method) the server considers your request satisfied and frees the memory occupied by your records. It will still retain information about your query preparation, until you disconnect.
You should bear in mind these steps when planning a database program, especially if you know that many clients could request the same service at the same time. Then, if you have a performance problem, you might guess where the problem lies.
I'll give you one example where a misunderstanding of the client/server principles creates a waste of memory and thus a reduction in general performance.
This script will keep the results locked in the database buffer until the customer has entered some input about all the records. If many clients use the same approach, it will result in the database draining out its memory very soon.
The right approach would be to save all records into an array and then asking for the customer's input. Even better, this situation calls for using one of the fetchall_* methods, asking the server to send you all the results at once.
If I had to name a single item where database programming is creating the most confusion for the beginner, I would say that the data structure is the first candidate.
When dealing with a database, your program has an internal data representation that usually doesn't reflect the external data organization. Most databases store data according to the relational model, where information is organized by tables, without redundancy. The links between data in different tables are called relations and are implemented by using key values.
A typical - albeit simplified - database representation of a sales system includes information about the customers and the products. Each customer has a unique field, also known as "primary key", which identifies the record. The same is true for the products.
Thus the order records, instead of repeating all the values from customers and products, store just a reference to them, their identifier. When used in such context, a field referring to a primary key in another table, the identifiers are known as "foreign keys".
This is a rather simplified view, which you should integrate with some more complete explanation of the relational model. (See some references at this address.) Here I just want to point to the shock effect of having this data structure in your program.
If I want to get the above data into my program, I have to decide how to represent the records. Should I blindly copy the same data structure?
These arrays are a mirror image of what I can get from the database. Why it is not a good idea to do such a thing? For at least two reasons:
The above data could be represented in your program by a hash of lists:
Transforming the data from the database into this common Perl structure is fairly simple, thanks to a well known idiom:
Another cause for distress is for the programmer who understands some of the problems coming from the relational model, but fails to understand that each result from the database is just one table, which we can see as a list of lists or a list of hashes, but it is a single entity. Whether it is a query from one table or the joined result of multiple tables, the result from the database is always one bi-dimensional array.
Who is used to desktop database applications, maybe the ones with a colorful GUI, has been exposed to the idea that tables can be "opened" and "edited", and then each record seen on the screen is the actual value in the database. Not so with client/server databases. Even if you are using an application that says "open table", what you get on your screen is the result of a query, just a copy of the actual data, and to modify it you have to issue an SQL command.
The beginner may suffer from the apparent dichotomy of having a permanent external data structure that often does not match with the one used for internal purposes. However, bearing in mind that the external data structure is not our business, since we can ask the database engine to return the most suitable result, we can concentrate on the internal part, considering the database interaction just a method to fill in our variables.
Databases can hold huge amounts of records. It should not come as a surprise to us. However, we sometimes forget that also database queries can give back a large number of records.
This fact could be neglected when we build an application and we use a test database that is likely to be empty or to hold just a handful of records. What could seem to work flawlessly in a database with just 100 records and three users might become disastrously inefficient when loaded with ten million records and thirty thousand users.
Database engines can filter your requests and give you an answer surprisingly quickly, provided that you asked the right thing.
Unless you have a very good reason, asking for one million records at once is unlikely to be "the right thing".
More database trouble
SQL is an abstract concept. You can't learn practical SQL without associating it with a specific database. And then you realize that what you are using is a database specific dialect, which looks like the mythical ANSI SQL, but it is very rarely implemented as such. It is something like learning English. There is no "standard English" anywhere, only dominant dialects, spread throughout the world by the force of Hollywood, the BBC and the CNN. Depending on which stream you were exposed, you will say "pavement" or "sidewalk" and you'll wear a "vest" under or over your shirt.
Similarly, if you have learned the Oracle syntax for a left join, you might look suspiciously at the MySQL equivalent.
The standard ANSI SQL provides many common chunks of the language, making you cope with most of the day-by-day cases. Knowing which parts of SQL implemented in your database of choice will comply with the standard is not only academic knowledge. It is a good investment towards portability and simplicity. If you use only standard constructs, you are more likely to find an answer when you get into trouble. If you are using only database-specific idioms, then you either are a wizard or you have to find one when things go wrong.
The DBI is the official interface between Perl and databases. It comes on the assumption that you know what you want, and you also know the underlying database, its design and how to deal with it. It may seem harsh, but such is life. Database programming does not have shortcuts. You may find some module that will help you to save some keystrokes. Some modules will handle many aspects of the database complexity for you. Some will ever promise to make database programming easier to understand, or to free you from the burden of database design. Don't fall into that trap. A DBI wrapper could be useful to speed up your development process or to help you handle the complexity of the database schemas. But none of them will replace your understanding of the subject. If you use a wrapper that promise you to deal with INNER, LEFT and RIGH JOINs, which you can't fully understand, then you are trusting an unknown piece of code to do something of which you have a limited command.
There are way too many database modules in the CPAN. Some of them are extremely valuable, provided that you understand what they are doing. Some are just a smart engine to save you some typing at the price of crippling the performance of your application and your flexibility in choosing the the most appropriate solution.
Like many other technological gadgets, DBI wrappers could be of help for the beginner, but they will keep the beginner in the lower end of the professional growth. Many people say that the DBI has a steep learning curve. That is not true. If you are not familiar with any relational client/server database, then the DBI will be as difficult as learning your first foreign language. However, if you know how a relational database works, especially if you know how a client/server engine reacts to your requests, then you can learn the DBI by evaluating similarities and differences with the general concepts that you already have in your bag of tricks.
The DBI is a tool. You can use it well if you know your trade. If you lack general database understanding, using the DBI will be like driving in the dark with your headlights off, and no cute module can protect you from an incoming disaster.
Note: Don't get me wrong. I am not against DBI wrappers. Actually, I appreciate some of them very much, like DBIx::DWIW. I just prefer to be in control. Others may feel differently. I don't impose my view on anybody and I keep an open mind about which wrappers could be useful for my needs.
Who's the client?
I said that client/server database applications have their specific annoyances, compared to stand-alone or desktop databases. One more source of confusion arises when we are dealing with a database server and a web server. Who is the client here?
Although the ultimate client of the web server is the user sitting in front of a browser, the client for the database is always a web server process. This is true even when the web client had to provide her username and password for database access. The web server gets the access information, does the talking with the database and returns an answer to the client.
It's important to remember this fact whenever there is a connection problem with the database. Unless username or password are wrong, the problem doesn't lie with the ultimate client (the web client), but with the web server. Testing the database connection with the appropriate user information outside the web server will usually tell you if you were providing your script the right parameters.
It's said that the only good advice you can get comes from your own experience, especially your own mistakes. While it is true that I cannot magically transfer to you my database programming experience, I can at least give you my summary of what I feel is its positive outcome. You may decide if it is useful and eventually give it a try.
Enhance your skills
If you consider all the elements involving database programming, the best piece of advice I can give you is to test such elements separately. When you face a problem, you should at least be able to attribute the problem to the appropriate layer: the script body, the web server interface, the DBI interaction, or whatever other module you may be using for your purpose.
Database related advice
The soundest advice I can give you is "Design before using". While this is true for all computer programming, it has a special meaning for database, because it implies some knowledge of the relational model.1
In database programming, you can shoot yourself in the foot in several ways. Even taking into consideration all the above wisdom and applying all the advice, your application can collapse by means of some silly overlooking. I will list here the most common ones, because I have seen them in too many SOPW requests.
That's all, folks!
That was all I could do without crossing the boundary between article and book.
I hope I managed to shed some light on a few dark, neglected aspects of database programming.
The bottom line is:
To make a better application, plan in advance, know your tools, and let the database engine do its job.
Any further contribution will be welcome.
update 4-May-2002. Fixed some typos.
_ _ _ _ (_|| | |(_|>< _|