http://www.perlmonks.org?node_id=118081

Kozz has asked for the wisdom of the Perl Monks concerning the following question:

This may or may not be a database question (not a perl question?), but I thought that the most wise monks could answer the question.

I need to write a perl script using the DBI module (mysql driver), and I will need to iterate over each and every record in a given table to perform some calculations. I though perhaps I should just go ahead and do something like

SELECT * FROM tablename
and then use $dbh->fetchrow_arrayref() to deal with each row of results at a time. However, there are thousands of records in the table right now, and when it's finally implemented, likely hundreds of thousands of records. Is this list of massive results stored in a temporary table someplace in the MySQL system, or is it stored in memory in the DBI object?

My goal is to be able to do these iterations over each and every record, but I don't want the entire results to be shoved into memory, resulting in possibly consuming a vast majority of the system's overall memory. It's not important that this script run very fast, but I don't want it to slow the entire system to a crawl.

Again, forgive me, but I'm not sure if this is a perl question or database question. Probably both.

Humbly yours,
Kozz

Replies are listed 'Best First'.
Danger! Danger!
by perrin (Chancellor) on Oct 11, 2001 at 01:47 UTC
    If you use fetchrow_* with DBD::mysql, it will actually fetch EVERY row into the client's local memory! I know it sounds strange, but that's the way it works by default for that particular DBD driver. If you want to change this behavior, you need to look at the mysql_store_result option. See this thread on the mod_perl list for more.
Re: Memory usage with DBI
by higle (Chaplain) on Oct 10, 2001 at 23:06 UTC
    The fetchrow_array and fetchrow_arrayref methods retrieve data from the datasource one single row at a time, and thus the system that the script is running on only has to deal with one row's worth of data at a time.

    The query results may be queued up in a buffer on the database side, but the database is optimized for that sort of thing.

    Overall, the DBI module is quite memory efficient. I've got a similar script running on my webserver that queries about a hundred thousand rows of data, generates HTML pages using the data, and copies the whole thing over to another machine. This batch takes about 10 to 15 minutes to run, even with all the disk I/O and querying going on, and has no effect on the other processes going on with that machine.

    I wouldn't worry too much about it! :)

    higle
      I wouldn't worry too much about it! :)

      actually, i'd worry plenty about memory usage and caching result sets if i were Kozz. as was remarked, the fetchrow_ methods do precisely that: fetch a single result record. however the fetchall_ methods cache entire result sets locally; that can get pretty massive if you're querying against credit bureau data or some 120 GB instance.

      it's usually a good idea to let the database server do as much work as possible, including summarization and grouping operations. sometimes you do need all query results before you can proceed, but more often than not, you can just process the results iteratively. and to reiterate Higle's comment, the dbms is usually much more conscientious about memory optimization than you'll be on the client side.

      cheers.

Re: Memory usage with DBI
by CubicSpline (Friar) on Oct 10, 2001 at 23:24 UTC
    I agree with higle that you don't have to worry about the number of rows that will be returned, but I would warn you about doing "SELECT *". There are probably a number of discourses about this on the site, but it's generally a good idea to only select the fields that you need. Especially when you are looking at pumping those fields into an array, if you have lots of columns that makes things messy, especially if someone else has to look at your code and figure out why you are wanting $a[0], $a[3], and $a[17] from the result array. This will also speed up the queries a bit because the database systems won't have to put together a result of all the columns. In general, it's just better practice not to SELECT *.
Re: Memory usage with DBI
by guidomortonski (Sexton) on Oct 11, 2001 at 15:25 UTC
    A someone else noted, select * is a bad idea. Even if you *want* all the fields in the table, it's better form (and easier to manage in the long run) if you name the fields you want to select. If you don't and someone changes the table then all of a sudden your array is out of whack and everything breaks. Same goes for inserts.

    If performance and speed are not as big an issue as nice, readable code you could also consider using fetchrow_hashref to fetch your data items as a hash of field name/value pairs.

    Guy

Re: Memory usage with DBI
by tadman (Prior) on Oct 11, 2001 at 19:01 UTC
    The "list of massive results" is stored locally in the driver, and is fetched through whatever method you prefer. This means that if you have a large table, you will need a large buffer. This bit me too, in DBI + 'SELECT *' - Memory Use Galore?. The two solutions were as follows:

    1. Use the 'LIMIT ?,?' parameter to incrementally retrieve the information in blocks that are a managable size. For example, 'LIMIT 10,0', then 'LIMIT 10,10', etc. There is a risk, though, of duplicated or missing results if the table is modified between fetches.
    2. Use the low-level MySQL 'mysql_use_result' feature, and a separate DBI handle, which switches to a non-buffered method. There is no real risk of duplicate or missing results.
    It would seem that if you are fetching the data row-by-row that the client would do the same, but this is not the case. Every time you call "execute", either explicitly or implicitly, such as through "do", the result of that is buffered in the client. This can consume an awful lot of memory, especially on tables with lots of rows.

    You can reduce memory usage by selecting only a few key columns. 'SELECT *' will download everything, and if you have several long text fields, this can be very expensive. 'SELECT some_key_field', by comparision, is much more compact.