How to count rows using DBI & mysql

Background

This is rather long, but, I hope helpful.
As part of developing a (DBI/MySQL) database to store info about modules (more @ Module Dependencies), I needed to repeatedly check if a certain row already exists in the Database in order to avoid putting another identical row in.

Each row is a tuple showing a "this module uses that module" relationship.

I wanted to do some performance testing to find out the fastest way of discovering if there is a matching row already in the DB.

Aside: I can't find the equivalent of an efficient "INSERT ... IF NOT EXISTS" in MySQL.

The code below shows the different ways I found to get the result.

Conclusions

The fastest way to do it is :

   my $sql = "SELECT uses FROM depend WHERE 
      usedby='Depend::Module' AND
      uses='DBI'";
   my $rows = $dbh->do($sql);
[download]

This may not seem surprising !

using WHERE A='X' AND B='Y' is faster than WHERE A='X'.
using COUNT(*) is very slow
do() is more than twice as fast as prepare()/execute()
Doesn't matter which column SELECT returns to do counting
SELECT col1 is slightly faster than SELECT *

I found the first 3 of these surprising. In particular, I thought that do() was just a wrapper round prepare/execute, so the time difference is significant.

Other optimisations and comments

Store IDs, rather than names in the DB. (Not tried - Would prob. be quicker, but at the expense of more lookup code and complication.)
Cache results in PERL. (Not tried - Likely to be quicker, but less "pure" and some complexity in forward references.)
Store "uses" relationship as a joined string in the module table and use PERLto manage. (Tried - was quicker, but again, not DB pure)
Not tried on other DBD drivers
The DB was small enough to fit in memory, and machine had no load, no no disk accesses during tests.
I don't know if there are any internal optimisations happening (i.e. reusing cached results)
I was less interested in the fact that some of the tests returned the 'wrong' result than the time taken, clearly in a full solution you need the correct answer :-)
Not sure whether the do() is portable. The DBI code seems to suggest that some drivers return -1 untill all rows are fetched. Not sure if this would propagate back through the do() result.

Test Code

#! /usr/local/bin/perl -w

#use strict; # because of the sub references
use File::Find;
use DBI();
use Depend::Module;
use Date::Manip;
use Benchmark;

### Table created with
##  CREATE TABLE depend (\
##     usedby VARCHAR(80) NOT NULL ,\
##     uses   VARCHAR(80) NOT NULL \
##  )

my $dbh = DBI->connect("DBI:mysql:database=modules;host=localhost",
      'root', 'blah', { 'RaiseError' => 1 });

for ('Way1A','Way2A','Way2B','Way3A','Way4A','Way4B','Way5A','Way5B','
+Way6A')
{
   my $ret = &$_;
   print "$_ returns [$ret]\n";
}

my $res = timethese(10000,
   {
   'Way1A'=>\&Way1A,
   'Way2A'=>\&Way2A,
   'Way3A'=>\&Way3A,
   'Way4A'=>\&Way4A,
   'Way4B'=>\&Way4B,
   'Way5A'=>\&Way5A,
   'Way6A'=>\&Way6A,
   'Way2B'=>\&Way2B,
   'Way5B'=>\&Way5B,
   });

Benchmark::cmpthese( $res) ;

exit;

######## Subs
sub Way1A
{
   # Incorrect result, Only check one field
   # Doesn't count second field in this routine
   # uses prepare and execute

   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
   my $sth = $dbh->prepare($sql);
   my $ret = $sth->execute;
   return $sth->rows;

}

sub Way2A
{
   # Incorrect result, Only check one field
   # Doesn't count second field in this routine
   # uses do

   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
   return $dbh->do($sql);
}

sub Way2B
{
   # Incorrect result, Only check one field
   # Doesn't count second field in this routine
   # uses do
   # SELECT returns * == both columns

   my $sql = "SELECT * FROM depend WHERE usedby='Depend::Module'";
   return $dbh->do($sql);
}

sub Way3A
{
   # Generates correct result
   # uses prepare and execute

   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
   my $sth = $dbh->prepare($sql);
   my $ret = $sth->execute;
   return $sth->rows;
}
sub Way4A
{
   # Generates correct result
   # uses do

   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
   my $ret = $dbh->do($sql);
   return $ret;
}

sub Way4B
{
   # Generates correct result
   # uses do
   # counts other column

   my $sql = "SELECT usedby FROM depend WHERE usedby='Depend::Module' 
+AND uses='DBI'";
   my $ret = $dbh->do($sql);
   return $ret;
}

sub Way5A
{
   # Fetch each row manually and do our own count
   # Incorrect result, only checks first field

   my $count = 0;
   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
   my $sth = $dbh->prepare($sql);
   my $ret = $sth->execute;
   while ($ret = $sth->fetch()) { $count++};
   return $count;
}

sub Way5B
{
   # Fetch each row manually and do our own count
   # Correct result

   my $count = 0;
   my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
   my $sth = $dbh->prepare($sql);
   my $ret = $sth->execute;
   while ($ret = $sth->fetch()) { $count++};
   return $count;
}
sub Way6A
{
   # use COUNT(*) to do counting
   # Correct result

   my $sql = "SELECT COUNT(*) FROM depend WHERE usedby='Depend::Module
+' AND uses='DBI'";
   my $sth = $dbh->prepare($sql);
   my $rst = $sth->execute;
   my @count = $sth->fetchrow_array();
   return $count[0];
}

__DATA__

Results using 10,000 iterations

        Rate Way5A Way6A Way1A Way5B Way3A Way2B Way2A Way4B Way4A
Way5A 1085/s    --   -9%  -10%  -13%  -13%  -52%  -53%  -65%  -65%
Way6A 1190/s   10%    --   -1%   -5%   -5%  -48%  -48%  -61%  -61%
Way1A 1208/s   11%    1%    --   -3%   -4%  -47%  -48%  -61%  -61%
Way5B 1250/s   15%    5%    4%    --   -0%  -45%  -46%  -59%  -59%
Way3A 1252/s   15%    5%    4%    0%    --  -45%  -46%  -59%  -59%
Way2B 2268/s  109%   90%   88%   81%   81%    --   -2%  -26%  -26%
Way2A 2304/s  112%   94%   91%   84%   84%    2%    --  -25%  -25%
Way4B 3077/s  184%  158%  155%  146%  146%   36%   34%    --    0%
Way4A 3077/s  184%  158%  155%  146%  146%   36%   34%    0%    --
[download]

--
Brovnik

Comment on How to count rows using DBI & mysql - Performance results Select or Download Code

Replies are listed 'Best First'.
Re: How to count rows using DBI & mysql - Performance results by busunsl (Vicar) on Jun 11, 2001 at 14:59 UTC
Some things that might have an influence on the benchmarks, depending on the RDBMS: If you just test for existence, you can use the `exists` function in a bit of sql-code: `if not exists (select * from depend where usedby='Depend::Module' AND +uses='DBI') begin insert .... end` [download] An index on the columns in where clause might have a huge impact if you use `count(*)`.	[reply] [d/l] [select]
Re: Re: How to count rows using DBI & mysql - Performance results by Brovnik (Hermit) on Jun 11, 2001 at 15:37 UTC
Useful comment. I hadn't thought of getting SQL to do the logic. An index on the columns in where clause might have a huge impact if you use count().* I believe that MySQL creates an index on the leftmost subset of columns anyway, due to the way it stores the data. It should be intelligent enough to only compare the second row if the first row succeeds. In the test case, only there are only 4 rows, so testing the remainder of the AND should be quick. Have just tested the suggestion. In fact, not having indexes doesn't seem to make any difference (1230/sec vs. 1174/sec), so I think I am correct that there is an index by default on the leftmost column. -- Brovnik	[reply]
Re: Re: Re: How to count rows using DBI & mysql - Performance results by busunsl (Vicar) on Jun 11, 2001 at 18:09 UTC
Most RDBMS won't even consider using an index on a table with only 4 rows due to the overhead in I/O and processing. I created the table in a Sybase database, filled it with about 2500 rows with my db-stress tool and tested against it (with only 1000 iterations). Read more... (15 kB)	[reply] [d/l] [select]
Re: Re: Re: Re: How to count rows using DBI & mysql - Performance results by elwarren (Priest) on Jun 11, 2001 at 22:45 UTC
Re: How to count rows using DBI & mysql - Performance results by DrZaius (Monk) on Jun 11, 2001 at 19:16 UTC
First off, if you want your code at all portable, do not use `REPLACE`. It is mysql specific (as far as I know). Next, indexes are great. Try `EXPLAIN`ing your queries. That will give you a clue on what is going down. Finally, make a function to check if the id exists and memioze it. This is essentially creating a perl cache, but it is transparent to you.	[reply] [d/l] [select]
Re: How to count rows using DBI & mysql - Performance results by Zaxo (Archbishop) on Jun 11, 2001 at 17:43 UTC
Declare your unique index to be a primary key. $sth->execute will fail if you try to insert a duplicated primary key. mysql> create table trial (id varchar(5) primary key, perl_mod varchar +(30), summary varchar(255)); Query OK, 0 rows affected (0.06 sec) mysql> insert into trial set id='aaaaa', > perl_mod='CGI', > summary='Essential'; Query OK, 1 row affected (0.04 sec) mysql> insert into trial set id='aaaab', > perl_mod='DBI', > summary='Essential'; Query OK, 1 row affected (0.00 sec) mysql> insert into trial set id='aaaab', > perl_mod='Lingua__Romana__Perligata', > summary='Fun'; ERROR 1062: Duplicate entry 'aaaab' for key 1 mysql> [download] After Compline, Zaxo	[reply] [d/l]
Re: Re: How to count rows using DBI & mysql - Performance results by jepri (Parson) on Jun 11, 2001 at 20:20 UTC
This trick has caused me problems. At one point I had Tie::DBI crashing when it got an error from the POSTGRESQL database saying the the database couldn't insert a row. I don't know why, but I had to work around by doing the SQL myself :( Also, talking coding style, finding out if you can do something by seeing if it throws an error is probably a worse option than checking for it cleanly in the first place. ____________________ Jeremy I didn't believe in evil until I dated it.	[reply]
Re: Re: Re: How to count rows using DBI & mysql - Performance results by Zaxo (Archbishop) on Jun 11, 2001 at 23:11 UTC
My point was to demonstrate the property of unique indexes which prevent duplication: a safety measure. I never said to use the error for program control. In fact I didn't exhibit any DBI code at all. In practice, with the index in place: `my $known_dep=$dbi->prepare_cached("select 1 from depends where usedby +=? and uses=?"); sub known_dep ($$) {#prototype because we have a fixed number of place +holders $known_dep->execute(@_); $known_dep->rows; # 1 or 0 for unique index. }` [download] For frequent operations, we can improve performance with the sth->bind_params() method. We search on an index, using a pre-prepared statement handle with placeholders. That will be fast. I didn't mention before that NULL in multiindexes is only available in MyISAM tables. After Compline, Zaxo	[reply] [d/l]
Re: Re: Re: How to count rows using DBI & mysql - Performance results by elwarren (Priest) on Jun 11, 2001 at 23:14 UTC
It is definitly a coding style question, because we do it all over the place with PL/SQL in Oracle. Though this would not generate an error, just an exception. We use the same technique to see if a query a row from cache, if it returns an exception then we hit the table instead.	[reply]
Re: Re: How to count rows using DBI & mysql - Performance results by hding (Chaplain) on Jun 11, 2001 at 17:51 UTC
If adopting this strategy, why not simply make it a unique index, as it may in fact not be the primary key of the table?	[reply]
Re: Re: How to count rows using DBI & mysql - Performance results by Brovnik (Hermit) on Jun 11, 2001 at 18:45 UTC
But I want to be able to have multiple rows with the same key/column1, Unique keys prevent that. -- Brovnik	[reply]
(Unique Index) Re x 3: How to count rows using DBI & mysql - Performance results by Zaxo (Archbishop) on Jun 11, 2001 at 19:25 UTC
hding is right. In that case, declare a multicolumn unique index. That does precisely what you are asking for. MySQL will look at the first column first, so it should be the more unique. [OB] A table needs a primary key. `create table depends (uses varchar(80) not null, usedby varchar(80) not null, unique deps (usedby,uses)); # naming the index is unnecessary, but may be a convenience` [download] Btw, why not represent a leaf node by usedby=NULL? That would give a faster select for leaf/branch. After Compline, Zaxo	[reply] [d/l]
Re: Re: Re: How to count rows using DBI & mysql - Performance results by hding (Chaplain) on Jun 11, 2001 at 19:07 UTC
Then why do you need to do an "INSERT ... IF NOT EXISTS"? (Note that if you're trying to enforce the uniqueness of some combination of columns, you can make the index on all of the columns; maybe this is what you mean.)	[reply]
Re: Re: Re: How to count rows using DBI & mysql - Performance results by elwarren (Priest) on Jun 11, 2001 at 22:38 UTC
In this case you should have a one to many foreign key relationship to a second table. This table would contain a unique column with all of your modules in it. Insert into this table, catch the exception if it happens.	[reply]
Re: How to count rows using DBI & mysql - Performance results by elwarren (Priest) on Jun 11, 2001 at 23:06 UTC
First let me say that this is not flame bait. I can only comment on Oracle, but these results should not be considered useful against any other database besides MySQL. I'll save the gritty details unless anyone wants to hear about them. One issue that I've not seen mentioned here is whether the database lives on the same machine as the code you're executing. I'm assuming it does, or you have a very fast network, or you have a small set of data. This is because if you were running on another machine your query would have to return the result set over the network, which could take a long time depending on the size of your data. One million rows * 1kB is much larger than a single count of the rows. Another optimization you may want to test out would be to try doing a SELECT COUNT(1) FROM TABLE WHERE BLAH this will return the same result as COUNT() but may run faster depending on what kind of optimizations are done. This can be faster because we're not asking the database to return rows from the table, just the count. An old oracle trick until they internally optimized COUNT() HTH	[reply]
Re: How to count rows using DBI & mysql - Performance results by bart (Canon) on Apr 30, 2004 at 13:13 UTC
Not sure whether the do() is portable. The DBI code seems to suggest that some drivers return -1 untill all rows are fetched. Not sure if this would propagate back through the do() result. I can confirm that it indeed does return -1, with a recent version of DBD::ODBC, and MS Access. It's a trick you should only use very cautiously.	[reply]
Re: How to count rows using DBI & mysql - Performance results by tune (Curate) on Jun 11, 2001 at 18:08 UTC
Just a short tip: The REPLACE function from MYSQL. It checks if the row is existing, and doing an INSERT (if not), or an UPDATE (if it is). However if you do not want the UPDATE feature, it does not help you. -- tune	[reply]
Re: Re: How to count rows using DBI & mysql - Performance results by Brovnik (Hermit) on Jun 11, 2001 at 18:51 UTC
I tried that, but REPLACE only works on a unique index. I need to be able to have e.g. `COL1 COL2 A B A C A D B C C D B D` [download] So neither column is unique. -- Brovnik	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks

How to count rows using DBI & mysql - Performance results