|Perl: the Markov chain saw|
Speeding up the DBIby gmax (Abbot)
|on Jul 14, 2003 at 12:02 UTC||Need Help??|
Those of us who have learned the DBI from the book, or from the official perldoc some versions ago, may not have noticed that the DBI, in recent versions, has added several new features and improved existing ones.
This is my case, anyway. I started working with the DBI from version 1.13, and only recently I opened my eyes to a whole new set of improvements.
This article discusses some of such features, with particular emphasis to the ones aiming at the application efficiency.
Let's start with the easy stuff.
This is not really an improvement in recent versions, but I believe it is important to mention it here, since there could be some misunderstanding about this issue.When issuing queries that do not return any result, the official way is to use the $dbh->do() method. Nothing wrong with that in principle, provided that we don't use it inside a critical loop.
What's wrong with this code? Nothing, except the do method is quietly creating and destroying a statement handler at each loop.
It's better to rewrite the loop this way.
The performance gain is achieved even for those databases (like MySQL) that don't support a prepare feature. Avoiding the creation and destruction of a statement handler in large loops can speed up things a lot.
Even better, if your database supports it, use the multiple INSERT statement.It requires more work on the client side, but you can insert thousands of records at once.
The above code works with MySQL. You should adjust to your database SQL syntax if different. See below a sample script with a more complete example of this idiom.
You know that fetchrow_arrayref is faster than fetchrow_array(). And you also know that fetchall_arrayref and selectall_arrayref are faster than fetchrow_arrayref. The question is how much faster? And, more important, is are they always faster?
Using a test MySQL table of 100_000 records with 20 columns of 9 chars each, Tim says that fetchrow_array is 12% slower than fetchrow_arrayref.
Notice first of all that fetchall_arrayref is slower than fetchrow_arrayref when fetching a relatively large amount of data. Why? Because the gain in speed is balanced by the amount of time needed to allocate memory for the results. Actually, if you call fetchall_arrayref without giving any size limit, you may very well get into trouble, either exhausting your memory or slowing your application response time unbearably.
But, wait a minute! How did he come up with such figures? How can you tell how many fetches per second your application has got?
You may think that, since he is the DBI creator, he must have a special test suite that he uses to amaze listeners at the Open Source conventions. Nothing like that. You can get the same results with the help of DBI::Profile.
But before coming to that, let's first approach the subject of dealing with large results.
We have seen that fetchall_arrayref is the fastest method you can use for fetching rows. The problem is that, when you have a large dataset that can even become larger as time goes, you are faced with the problem of memory allocation. Fetching 100,000 rows with 20 columns of 9 chars each is 17 MB, but Perl is not going to allocate simply that. Due to Perl's dynamic memory allocation for arrays, your 17 MB are likely to become 20 or more. And surely there are cases where you don't want that much memory locked up. But anyway, even if Perl were allocating just the memory it exactly takes to store your data and the records were 1 million or 10 million, then you would definetely avoid fetchall_arrayref.
So, unless you really need to have all your dataset nicely stored in your client memory all at once, you should either use fetchrow_arrayref or use another particular idiom that Tim Bunce recommends.
The truth is, despite the name, you don't have to fetch all the records fetchall_arrayref. You can pass a $max_rows argument, which will make fetchall_arrayref behave like a fetchrow_arrayref on steroids.
Passing $max_rows to the method we give a limit of the number of rows that we want returned to our array reference at any given time. So instead of having an array of 1,000,000 rows, you are going to have just a 5,000 rows array, which will be refreshed 200 times until all rows are processed.
If you want to process the rows one by one and still take advantage of fetchall_arrayref speed, you can use this handy idiom.
$aref gets the next available row from $rowcache. When it is empty, another call to fetchall_arrayref will replenish the cache, and so on till the end.
Before profiling, we need to create a suitable data set. Here are two scripts to create a 100,000 records table with MySQL and SQLite.
The following script creates a table "testdbi" in the "test" database. You can change the number of wanted records, columns, sizes, and the number of rows per insert query.
This scripts gives you also an example of how to build a large INSERT query with multiple lines. The default value is 10,000 rows per query. If your system has a small max_allowed_packet (the default is 1MB) you can either increase its size or reduce the number of rows.
The table created is the same seen for the MySQL example. The main difference is that SQLite does not support multiple lines INSERTs. The point here is the transaction size. If you COMMIT too often, the insertion takes forever. Depending on your system memory, a transaction size ($max_commit) between 1,000 and 5,000 should allow for a creation in a reasonably short time (20 seconds in my P4 1.7Ghz).
To improve the efficiency of a program we need to find its bottlenecks, the parts where the application spends most of the time, so that we can concentrate on the hot spots.
Benchmarking is something we do when we want to compare two or more different approaches, to find the most efficient one.
Profiling, on the other hand, is what we do when we don't know where to look, and we want to know which parts of our application are occupying most of the time.
Perl script profiling is something you can do using the general purpose module Devel::Profile. This is a module that runs your script in debug mode and tells you how long does it take to run each routine in it. However, it tells you the time spent in each sub looking from outside. You may get a clue of where the bottleneck is, but when dealing with the DBI, where efficiency is paramount, you may want to go further, and get more detail, which should only come from the DBI innards.
Benchmarking is a possibility. You compare a few very intensive routines that use similar DBI methods, to see how they can affect the whole application. Trouble is, for benchmarking a fetch method you should either include in the sub also the execute call or compare the method using two different handlers, thus risking that the general performance of the database may influence your results one way or another.
I am not saying that benchmark are useless. Depending on your needs, they could fit the bill just fine, and actually I use them quite often. I just want to add a word of caution about examining carefully what you are including in your benchmarks, either implicitly or explicitly.
If you don't like benchmarks, or they don't suit your immediate purpose, or simply you want to know what yoy need to benchmark, then the other road to take is DBI::Profile. Unlike Devel::Profile, it works in direct co-operation with the DBI, using appropriate handles to give you accurate measurements of what your DBI methods are doing. It will also profile itself, so you know how much overhead you should consider when evaluating its results.
The easiest way of enabling profiling is by setting an environment variable.
Given this test script,
And the output should be something like this.
Not really enlightening, eh? Let's try with some other level. Using DBI_PROFILE=4 we get more info:
Now you can see that every DBI method used in the script is separately profiled. Let's look at the entry for fetchrow_arrayref. The first number is the actual time spent (0.000041 seconds), which, divided by the number of calls (2) gives the average (0.000021 seconds) time for each execution.
Looking at the script again, you see that there is only a "SELECT COUNT(*)" query without "GROUP BY" clause, so it should return only one row. Why is DBI saying that fetchrow_arrayref was called 2 times? I leave this as an exercise for the reader. (Highlight the text below with your mouse to show the answer)
The environment variable method can be useful for a quick and dirty solution. But the DBI::Profile is more than that, and it provides you several more methods to fine tuning your application.
Before I go into details, let me tell you that, for all the profiling methods, you can redirect the output to a file, format the output with DBI::ProfileDumper and read the report using an ad hoc application (dbiprof).
Read more about these features:
Coming back to our fine tuning needs, let's see how we can get a profile of just the pieces we need.
Instead of firing up profile support for everything, you can enable it for just one handler, either for the whole program or for a limited time only.
Level 4 is what we have seen in the examples so far. The level 8, in addition, gives you the origin of the methods called. Applying these changes to our previous example, and calling it without setting the environment variable, we get this output:
Notice now that the first four entries are about $dbh and are whown in level 4 format, while the rest of the report pertains to $sth and show, in addition, which method in the DBD driver was actually called. If you are wondering how and when the report gets printed, it happens when the handler gets DESTROYed. You need to remember this, since the profile status remains enabled until the handler gets out of scope or it is manually deactivated.
To disable a handler's profile status, set it to 0.
This will prevent the profile report from kicking out at DESTROY time. If you have done this, though, you need to print or export the profile data somehow, or else profiling would have been pointless. There are many ways for this task. My favorite one is creating an instance of a DBI::Profile object and using its methods directly.
There is more (a lot more) to say about DBI profiling, but I was only aiming at getting you staarted with this subject, which to the best of my knowledge has never been discussed in PM so far (It was just briefly mentioned in two nodes).
For more info, then, you should resort to
As a last issue, I present you a profiling and benchmarking script. It uses the test table created above and tests the fetch of 1, 10, and 20 columns from 100,000 records.
My intention with this script was to replicate Tim Bunce's benchmarks. After running it in a couple of machines, though, I saw that its results are heavily influenced (as you can imagine) by hardware and software configuration. Especially with MySQL, when available RAM and processor speed change substantially, the results may range hugely. In a well optimized server, I calculated a number of fetches 10 times faster than the ones reported by Tim Bunce.
But the biggest surprise was that, in a fast server, the difference between fetchrow_array and fetchrow_arrayref is sometimes leveled. Notice in the results that fetchrow_array seems to be faster than fetchrow_arrayref when fetching one column only. With a slower computer, this was not the case. I am trying to pinpoint the reason for that, and every help in that direction will be appreciated.
You can run the script for MySQL of SQLite. Just add the driver name in the command line. For other drivers, it should be enough to add an appropriate connection statement. What follows is a sample output for MySQL (4.0.13) under Linux.
This node ran for longer than I thought when I started thinking about it. Actually, I had something more to say on the whole subject, but I guess that for one reading session this should be more than enough.
What have I learned while digging for this presentation? Mostly that I should take the general instructions with caution. I know that method x is faster than method y, which is in turn faster than method z. All beautiful, except sometimes, depending on the design, the network load, the number of records, method z could be faster than method x.
What to do then? Design carefully, test, benchmark and profile. Then re-read your design. If it makes sense together with the numbers you've crunched in your tests, go ahead. Else, spend some more time thinking about the matter. It usually helps.
Update (1) Thanks to dws, who spotted an inconsistency in one of my first examples. Now fixed.
Update (2) Thanks to jeffa, for spotting some unclosed tags. Fixed.
_ _ _ _ (_|| | |(_|>< _|