Background
This is rather long, but, I hope helpful.
As part of developing a (DBI/MySQL) database to store info
about modules (more @ Module Dependencies), I needed to
repeatedly check if a certain row already exists in the Database
in order to avoid putting another identical row in.
Each row is a tuple showing a "this module uses that module" relationship.
I wanted to do some performance testing to find out the fastest way of discovering if there is a matching
row already in the DB.
Aside: I can't find the equivalent of an efficient "INSERT ... IF NOT EXISTS" in MySQL.
The code below shows the different ways I found to get the result.
Conclusions
The fastest way to do it is :
my $sql = "SELECT uses FROM depend WHERE
usedby='Depend::Module' AND
uses='DBI'";
my $rows = $dbh->do($sql);
This may not seem surprising !
- using WHERE A='X' AND B='Y' is faster than WHERE A='X'.
- using COUNT(*) is very slow
- do() is more than twice as fast as prepare()/execute()
- Doesn't matter which column SELECT returns to do counting
- SELECT col1 is slightly faster than SELECT *
I found the first 3 of these surprising. In particular, I thought that
do() was just a wrapper round prepare/execute, so the time difference is
significant.
Other optimisations and comments
- Store IDs, rather than names in the DB. (Not tried - Would prob. be quicker, but at the expense of more lookup code and complication.)
- Cache results in PERL. (Not tried - Likely to be quicker, but less "pure" and some complexity in forward references.)
- Store "uses" relationship as a joined string in the module table and use PERLto manage. (Tried - was quicker, but again, not DB pure)
- Not tried on other DBD drivers
- The DB was small enough to fit in memory, and machine had no load, no no disk accesses during tests.
- I don't know if there are any internal optimisations happening (i.e. reusing cached results)
- I was less interested in the fact that some of the tests returned the 'wrong' result than the time taken, clearly in a
full solution you need the correct answer :-)
- Not sure whether the do() is portable. The DBI code seems
to suggest that some drivers return -1 untill all rows are fetched. Not sure if
this would propagate back through the do() result.
Test Code
#! /usr/local/bin/perl -w
#use strict; # because of the sub references
use File::Find;
use DBI();
use Depend::Module;
use Date::Manip;
use Benchmark;
### Table created with
## CREATE TABLE depend (\
## usedby VARCHAR(80) NOT NULL ,\
## uses VARCHAR(80) NOT NULL \
## )
my $dbh = DBI->connect("DBI:mysql:database=modules;host=localhost",
'root', 'blah', { 'RaiseError' => 1 });
for ('Way1A','Way2A','Way2B','Way3A','Way4A','Way4B','Way5A','Way5B','
+Way6A')
{
my $ret = &$_;
print "$_ returns [$ret]\n";
}
my $res = timethese(10000,
{
'Way1A'=>\&Way1A,
'Way2A'=>\&Way2A,
'Way3A'=>\&Way3A,
'Way4A'=>\&Way4A,
'Way4B'=>\&Way4B,
'Way5A'=>\&Way5A,
'Way6A'=>\&Way6A,
'Way2B'=>\&Way2B,
'Way5B'=>\&Way5B,
});
Benchmark::cmpthese( $res) ;
exit;
######## Subs
sub Way1A
{
# Incorrect result, Only check one field
# Doesn't count second field in this routine
# uses prepare and execute
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
my $sth = $dbh->prepare($sql);
my $ret = $sth->execute;
return $sth->rows;
}
sub Way2A
{
# Incorrect result, Only check one field
# Doesn't count second field in this routine
# uses do
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
return $dbh->do($sql);
}
sub Way2B
{
# Incorrect result, Only check one field
# Doesn't count second field in this routine
# uses do
# SELECT returns * == both columns
my $sql = "SELECT * FROM depend WHERE usedby='Depend::Module'";
return $dbh->do($sql);
}
sub Way3A
{
# Generates correct result
# uses prepare and execute
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
my $sth = $dbh->prepare($sql);
my $ret = $sth->execute;
return $sth->rows;
}
sub Way4A
{
# Generates correct result
# uses do
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
my $ret = $dbh->do($sql);
return $ret;
}
sub Way4B
{
# Generates correct result
# uses do
# counts other column
my $sql = "SELECT usedby FROM depend WHERE usedby='Depend::Module'
+AND uses='DBI'";
my $ret = $dbh->do($sql);
return $ret;
}
sub Way5A
{
# Fetch each row manually and do our own count
# Incorrect result, only checks first field
my $count = 0;
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'";
my $sth = $dbh->prepare($sql);
my $ret = $sth->execute;
while ($ret = $sth->fetch()) { $count++};
return $count;
}
sub Way5B
{
# Fetch each row manually and do our own count
# Correct result
my $count = 0;
my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN
+D uses='DBI'";
my $sth = $dbh->prepare($sql);
my $ret = $sth->execute;
while ($ret = $sth->fetch()) { $count++};
return $count;
}
sub Way6A
{
# use COUNT(*) to do counting
# Correct result
my $sql = "SELECT COUNT(*) FROM depend WHERE usedby='Depend::Module
+' AND uses='DBI'";
my $sth = $dbh->prepare($sql);
my $rst = $sth->execute;
my @count = $sth->fetchrow_array();
return $count[0];
}
__DATA__
Results using 10,000 iterations
Rate Way5A Way6A Way1A Way5B Way3A Way2B Way2A Way4B Way4A
Way5A 1085/s -- -9% -10% -13% -13% -52% -53% -65% -65%
Way6A 1190/s 10% -- -1% -5% -5% -48% -48% -61% -61%
Way1A 1208/s 11% 1% -- -3% -4% -47% -48% -61% -61%
Way5B 1250/s 15% 5% 4% -- -0% -45% -46% -59% -59%
Way3A 1252/s 15% 5% 4% 0% -- -45% -46% -59% -59%
Way2B 2268/s 109% 90% 88% 81% 81% -- -2% -26% -26%
Way2A 2304/s 112% 94% 91% 84% 84% 2% -- -25% -25%
Way4B 3077/s 184% 158% 155% 146% 146% 36% 34% -- 0%
Way4A 3077/s 184% 158% 155% 146% 146% 36% 34% 0% --
-- Brovnik
Re: How to count rows using DBI & mysql - Performance results
by busunsl (Vicar) on Jun 11, 2001 at 14:59 UTC
|
Some things that might have an influence on the benchmarks, depending on the RDBMS:
If you just test for existence, you can use the exists function in a bit of sql-code:
if not exists (select * from depend where usedby='Depend::Module' AND
+uses='DBI')
begin
insert ....
end
An index on the columns in where clause might have a huge impact if you use count(*).
| [reply] [d/l] [select] |
|
Useful comment. I hadn't thought of getting SQL to do the logic.
An index on the columns in where clause might have a huge impact if you use count(*).
I believe that MySQL creates an index on the leftmost subset of columns anyway, due to the way it stores
the data. It should be intelligent enough to only compare the second row if the first row succeeds. In the test case,
only there are only 4 rows, so testing the remainder of the AND should be quick.
Have just tested the suggestion. In fact, not having indexes doesn't seem to make any difference (1230/sec vs. 1174/sec),
so I think I am correct that there is an index by default on the leftmost column.
-- Brovnik
| [reply] |
|
| [reply] [d/l] [select] |
|
Re: How to count rows using DBI & mysql - Performance results
by DrZaius (Monk) on Jun 11, 2001 at 19:16 UTC
|
First off, if you want your code at all portable, do not use REPLACE. It is mysql specific (as far as I know).
Next, indexes are great. Try EXPLAINing your queries. That will give you a clue on what is going down.
Finally, make a function to check if the id exists and memioze it.
This is essentially creating a perl cache, but it is transparent to you.
| [reply] [d/l] [select] |
Re: How to count rows using DBI & mysql - Performance results
by Zaxo (Archbishop) on Jun 11, 2001 at 17:43 UTC
|
Declare your unique index to be a primary key. $sth->execute will fail if you try to insert a duplicated primary key.
mysql> create table trial (id varchar(5) primary key, perl_mod varchar
+(30), summary varchar(255));
Query OK, 0 rows affected (0.06 sec)
mysql> insert into trial set id='aaaaa',
> perl_mod='CGI',
> summary='Essential';
Query OK, 1 row affected (0.04 sec)
mysql> insert into trial set id='aaaab',
> perl_mod='DBI',
> summary='Essential';
Query OK, 1 row affected (0.00 sec)
mysql> insert into trial set id='aaaab',
> perl_mod='Lingua__Romana__Perligata',
> summary='Fun';
ERROR 1062: Duplicate entry 'aaaab' for key 1
mysql>
After Compline, Zaxo | [reply] [d/l] |
|
This trick has caused me problems. At one point I had Tie::DBI crashing when it got an error from the POSTGRESQL database saying the the database couldn't insert a row. I don't know why, but I had to work around by doing the SQL myself :(
Also, talking coding style, finding out if you can do something by seeing if it throws an error is probably a worse option than checking for it cleanly in the first place. ____________________
Jeremy
I didn't believe in evil until I dated it.
| [reply] |
|
My point was to demonstrate the property of unique indexes which prevent duplication: a safety measure. I never said to use the error for program control. In fact I didn't exhibit any DBI code at all.
In practice, with the index in place:
my $known_dep=$dbi->prepare_cached("select 1 from depends where usedby
+=? and uses=?");
sub known_dep ($$) {#prototype because we have a fixed number of place
+holders
$known_dep->execute(@_);
$known_dep->rows; # 1 or 0 for unique index.
}
For frequent operations, we can improve performance with the sth->bind_params() method.
We search on an index, using a pre-prepared statement handle with placeholders. That will be fast.
I didn't mention before that NULL in multiindexes is only available in MyISAM tables.
After Compline, Zaxo
| [reply] [d/l] |
|
It is definitly a coding style question, because we do it
all over the place with PL/SQL in Oracle. Though this would
not generate an error, just an exception. We use the same
technique to see if a query a row from cache, if it returns
an exception then we hit the table instead.
| [reply] |
|
| [reply] |
|
But I want to be able to have multiple rows with the same key/column1,
Unique keys prevent that.
-- Brovnik
| [reply] |
|
hding is right. In that case, declare a multicolumn unique index. That does precisely what you are asking for. MySQL will look at the first column first, so it should be the more unique. [OB] A table needs a primary key.
create table depends
(uses varchar(80) not null,
usedby varchar(80) not null,
unique deps (usedby,uses));
# naming the index is unnecessary, but may be a convenience
Btw, why not represent a leaf node by usedby=NULL? That would give a faster select for leaf/branch.
After Compline, Zaxo
| [reply] [d/l] |
|
| [reply] |
|
In this case you should have a one to many foreign key
relationship to a second table. This table would contain
a unique column with all of your modules in it. Insert
into this table, catch the exception if it happens.
| [reply] |
Re: How to count rows using DBI & mysql - Performance results
by elwarren (Priest) on Jun 11, 2001 at 23:06 UTC
|
First let me say that this is not flame bait. I can only
comment on Oracle, but these results should not be
considered useful against any other database besides MySQL.
I'll save the gritty details unless anyone wants to hear
about them.
One issue that I've not seen mentioned here is whether the
database lives on the same machine as the code you're
executing. I'm assuming it does, or you have a very fast
network, or you have a small set of data. This is because
if you were running on another machine your query would
have to return the result set over the network, which could
take a long time depending on the size of your data. One
million rows * 1kB is much larger than a single count of
the rows.
Another optimization you may want to test out would be to
try doing a SELECT COUNT(1) FROM TABLE WHERE BLAH this will
return the same result as COUNT(*) but may run faster
depending on what kind of optimizations are done. This can
be faster because we're not asking the database to return
rows from the table, just the count. An old oracle trick
until they internally optimized COUNT(*)
HTH | [reply] |
Re: How to count rows using DBI & mysql - Performance results
by bart (Canon) on Apr 30, 2004 at 13:13 UTC
|
Not sure whether the do() is portable. The DBI code seems to suggest that some drivers return -1 untill all rows are fetched. Not sure if this would propagate back through the do() result.
I can confirm that it indeed does return -1, with a recent version of DBD::ODBC, and MS Access. It's a trick you should only use very cautiously.
| [reply] |
Re: How to count rows using DBI & mysql - Performance results
by tune (Curate) on Jun 11, 2001 at 18:08 UTC
|
Just a short tip:
The REPLACE function from MYSQL. It checks if the row is existing, and doing an INSERT (if not), or an UPDATE (if it is). However if you do not want the UPDATE feature, it does not help you.
--
tune | [reply] |
|
I tried that, but REPLACE only works on a unique index. I need to be able to have e.g.
COL1 COL2
A B
A C
A D
B C
C D
B D
So neither column is unique.
-- Brovnik | [reply] [d/l] |
|
|