http://www.perlmonks.org?node_id=87439

Background

This is rather long, but, I hope helpful.
As part of developing a (DBI/MySQL) database to store info about modules (more @ Module Dependencies), I needed to repeatedly check if a certain row already exists in the Database in order to avoid putting another identical row in.

Each row is a tuple showing a "this module uses that module" relationship.

I wanted to do some performance testing to find out the fastest way of discovering if there is a matching row already in the DB.

Aside: I can't find the equivalent of an efficient "INSERT ... IF NOT EXISTS" in MySQL.

The code below shows the different ways I found to get the result.

Conclusions

The fastest way to do it is :

my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AND uses='DBI'"; my $rows = $dbh->do($sql);
This may not seem surprising !
  1. using WHERE A='X' AND B='Y' is faster than WHERE A='X'.
  2. using COUNT(*) is very slow
  3. do() is more than twice as fast as prepare()/execute()
  4. Doesn't matter which column SELECT returns to do counting
  5. SELECT col1 is slightly faster than SELECT *
I found the first 3 of these surprising. In particular, I thought that do() was just a wrapper round prepare/execute, so the time difference is significant.

Other optimisations and comments

  1. Store IDs, rather than names in the DB. (Not tried - Would prob. be quicker, but at the expense of more lookup code and complication.)
  2. Cache results in PERL. (Not tried - Likely to be quicker, but less "pure" and some complexity in forward references.)
  3. Store "uses" relationship as a joined string in the module table and use PERLto manage. (Tried - was quicker, but again, not DB pure)
  4. Not tried on other DBD drivers
  5. The DB was small enough to fit in memory, and machine had no load, no no disk accesses during tests.
  6. I don't know if there are any internal optimisations happening (i.e. reusing cached results)
  7. I was less interested in the fact that some of the tests returned the 'wrong' result than the time taken, clearly in a full solution you need the correct answer :-)
  8. Not sure whether the do() is portable. The DBI code seems to suggest that some drivers return -1 untill all rows are fetched. Not sure if this would propagate back through the do() result.

Test Code

#! /usr/local/bin/perl -w #use strict; # because of the sub references use File::Find; use DBI(); use Depend::Module; use Date::Manip; use Benchmark; ### Table created with ## CREATE TABLE depend (\ ## usedby VARCHAR(80) NOT NULL ,\ ## uses VARCHAR(80) NOT NULL \ ## ) my $dbh = DBI->connect("DBI:mysql:database=modules;host=localhost", 'root', 'blah', { 'RaiseError' => 1 }); for ('Way1A','Way2A','Way2B','Way3A','Way4A','Way4B','Way5A','Way5B',' +Way6A') { my $ret = &$_; print "$_ returns [$ret]\n"; } my $res = timethese(10000, { 'Way1A'=>\&Way1A, 'Way2A'=>\&Way2A, 'Way3A'=>\&Way3A, 'Way4A'=>\&Way4A, 'Way4B'=>\&Way4B, 'Way5A'=>\&Way5A, 'Way6A'=>\&Way6A, 'Way2B'=>\&Way2B, 'Way5B'=>\&Way5B, }); Benchmark::cmpthese( $res) ; exit; ######## Subs sub Way1A { # Incorrect result, Only check one field # Doesn't count second field in this routine # uses prepare and execute my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'"; my $sth = $dbh->prepare($sql); my $ret = $sth->execute; return $sth->rows; } sub Way2A { # Incorrect result, Only check one field # Doesn't count second field in this routine # uses do my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'"; return $dbh->do($sql); } sub Way2B { # Incorrect result, Only check one field # Doesn't count second field in this routine # uses do # SELECT returns * == both columns my $sql = "SELECT * FROM depend WHERE usedby='Depend::Module'"; return $dbh->do($sql); } sub Way3A { # Generates correct result # uses prepare and execute my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN +D uses='DBI'"; my $sth = $dbh->prepare($sql); my $ret = $sth->execute; return $sth->rows; } sub Way4A { # Generates correct result # uses do my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN +D uses='DBI'"; my $ret = $dbh->do($sql); return $ret; } sub Way4B { # Generates correct result # uses do # counts other column my $sql = "SELECT usedby FROM depend WHERE usedby='Depend::Module' +AND uses='DBI'"; my $ret = $dbh->do($sql); return $ret; } sub Way5A { # Fetch each row manually and do our own count # Incorrect result, only checks first field my $count = 0; my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module'"; my $sth = $dbh->prepare($sql); my $ret = $sth->execute; while ($ret = $sth->fetch()) { $count++}; return $count; } sub Way5B { # Fetch each row manually and do our own count # Correct result my $count = 0; my $sql = "SELECT uses FROM depend WHERE usedby='Depend::Module' AN +D uses='DBI'"; my $sth = $dbh->prepare($sql); my $ret = $sth->execute; while ($ret = $sth->fetch()) { $count++}; return $count; } sub Way6A { # use COUNT(*) to do counting # Correct result my $sql = "SELECT COUNT(*) FROM depend WHERE usedby='Depend::Module +' AND uses='DBI'"; my $sth = $dbh->prepare($sql); my $rst = $sth->execute; my @count = $sth->fetchrow_array(); return $count[0]; } __DATA__ Results using 10,000 iterations Rate Way5A Way6A Way1A Way5B Way3A Way2B Way2A Way4B Way4A Way5A 1085/s -- -9% -10% -13% -13% -52% -53% -65% -65% Way6A 1190/s 10% -- -1% -5% -5% -48% -48% -61% -61% Way1A 1208/s 11% 1% -- -3% -4% -47% -48% -61% -61% Way5B 1250/s 15% 5% 4% -- -0% -45% -46% -59% -59% Way3A 1252/s 15% 5% 4% 0% -- -45% -46% -59% -59% Way2B 2268/s 109% 90% 88% 81% 81% -- -2% -26% -26% Way2A 2304/s 112% 94% 91% 84% 84% 2% -- -25% -25% Way4B 3077/s 184% 158% 155% 146% 146% 36% 34% -- 0% Way4A 3077/s 184% 158% 155% 146% 146% 36% 34% 0% --

--
Brovnik

Replies are listed 'Best First'.
Re: How to count rows using DBI & mysql - Performance results
by busunsl (Vicar) on Jun 11, 2001 at 14:59 UTC
    Some things that might have an influence on the benchmarks, depending on the RDBMS:

    If you just test for existence, you can use the exists function in a bit of sql-code:

    if not exists (select * from depend where usedby='Depend::Module' AND +uses='DBI') begin insert .... end

    An index on the columns in where clause might have a huge impact if you use count(*).

      Useful comment. I hadn't thought of getting SQL to do the logic.

      An index on the columns in where clause might have a huge impact if you use count(*).

      I believe that MySQL creates an index on the leftmost subset of columns anyway, due to the way it stores the data. It should be intelligent enough to only compare the second row if the first row succeeds. In the test case, only there are only 4 rows, so testing the remainder of the AND should be quick.

      Have just tested the suggestion. In fact, not having indexes doesn't seem to make any difference (1230/sec vs. 1174/sec), so I think I am correct that there is an index by default on the leftmost column.
      --
      Brovnik

        Most RDBMS won't even consider using an index on a table with only 4 rows due to the overhead in I/O and processing.

        I created the table in a Sybase database, filled it with about 2500 rows with my db-stress tool and tested against it (with only 1000 iterations).

Re: How to count rows using DBI & mysql - Performance results
by DrZaius (Monk) on Jun 11, 2001 at 19:16 UTC
    First off, if you want your code at all portable, do not use REPLACE. It is mysql specific (as far as I know).

    Next, indexes are great. Try EXPLAINing your queries. That will give you a clue on what is going down.

    Finally, make a function to check if the id exists and memioze it. This is essentially creating a perl cache, but it is transparent to you.

Re: How to count rows using DBI & mysql - Performance results
by Zaxo (Archbishop) on Jun 11, 2001 at 17:43 UTC

    Declare your unique index to be a primary key. $sth->execute will fail if you try to insert a duplicated primary key.

    mysql> create table trial (id varchar(5) primary key, perl_mod varchar +(30), summary varchar(255)); Query OK, 0 rows affected (0.06 sec) mysql> insert into trial set id='aaaaa', > perl_mod='CGI', > summary='Essential'; Query OK, 1 row affected (0.04 sec) mysql> insert into trial set id='aaaab', > perl_mod='DBI', > summary='Essential'; Query OK, 1 row affected (0.00 sec) mysql> insert into trial set id='aaaab', > perl_mod='Lingua__Romana__Perligata', > summary='Fun'; ERROR 1062: Duplicate entry 'aaaab' for key 1 mysql>

    After Compline,
    Zaxo

      This trick has caused me problems. At one point I had Tie::DBI crashing when it got an error from the POSTGRESQL database saying the the database couldn't insert a row. I don't know why, but I had to work around by doing the SQL myself :(

      Also, talking coding style, finding out if you can do something by seeing if it throws an error is probably a worse option than checking for it cleanly in the first place.

      ____________________
      Jeremy
      I didn't believe in evil until I dated it.

        My point was to demonstrate the property of unique indexes which prevent duplication: a safety measure. I never said to use the error for program control. In fact I didn't exhibit any DBI code at all.

        In practice, with the index in place:

        my $known_dep=$dbi->prepare_cached("select 1 from depends where usedby +=? and uses=?"); sub known_dep ($$) {#prototype because we have a fixed number of place +holders $known_dep->execute(@_); $known_dep->rows; # 1 or 0 for unique index. }

        For frequent operations, we can improve performance with the sth->bind_params() method.

        We search on an index, using a pre-prepared statement handle with placeholders. That will be fast.

        I didn't mention before that NULL in multiindexes is only available in MyISAM tables.

        After Compline,
        Zaxo

        It is definitly a coding style question, because we do it all over the place with PL/SQL in Oracle. Though this would not generate an error, just an exception. We use the same technique to see if a query a row from cache, if it returns an exception then we hit the table instead.

      If adopting this strategy, why not simply make it a unique index, as it may in fact not be the primary key of the table?

      But I want to be able to have multiple rows with the same key/column1,
      Unique keys prevent that.
      --
      Brovnik

        hding is right. In that case, declare a multicolumn unique index. That does precisely what you are asking for. MySQL will look at the first column first, so it should be the more unique. [OB] A table needs a primary key.

        create table depends (uses varchar(80) not null, usedby varchar(80) not null, unique deps (usedby,uses)); # naming the index is unnecessary, but may be a convenience

        Btw, why not represent a leaf node by usedby=NULL? That would give a faster select for leaf/branch.

        After Compline,
        Zaxo

        Then why do you need to do an "INSERT ... IF NOT EXISTS"? (Note that if you're trying to enforce the uniqueness of some combination of columns, you can make the index on all of the columns; maybe this is what you mean.)

        In this case you should have a one to many foreign key relationship to a second table. This table would contain a unique column with all of your modules in it. Insert into this table, catch the exception if it happens.
Re: How to count rows using DBI & mysql - Performance results
by elwarren (Priest) on Jun 11, 2001 at 23:06 UTC
    First let me say that this is not flame bait. I can only comment on Oracle, but these results should not be considered useful against any other database besides MySQL. I'll save the gritty details unless anyone wants to hear about them.

    One issue that I've not seen mentioned here is whether the database lives on the same machine as the code you're executing. I'm assuming it does, or you have a very fast network, or you have a small set of data. This is because if you were running on another machine your query would have to return the result set over the network, which could take a long time depending on the size of your data. One million rows * 1kB is much larger than a single count of the rows.

    Another optimization you may want to test out would be to try doing a SELECT COUNT(1) FROM TABLE WHERE BLAH this will return the same result as COUNT(*) but may run faster depending on what kind of optimizations are done. This can be faster because we're not asking the database to return rows from the table, just the count. An old oracle trick until they internally optimized COUNT(*)

    HTH
Re: How to count rows using DBI & mysql - Performance results
by bart (Canon) on Apr 30, 2004 at 13:13 UTC
    Not sure whether the do() is portable. The DBI code seems to suggest that some drivers return -1 untill all rows are fetched. Not sure if this would propagate back through the do() result.
    I can confirm that it indeed does return -1, with a recent version of DBD::ODBC, and MS Access. It's a trick you should only use very cautiously.
Re: How to count rows using DBI & mysql - Performance results
by tune (Curate) on Jun 11, 2001 at 18:08 UTC
    Just a short tip:
    The REPLACE function from MYSQL. It checks if the row is existing, and doing an INSERT (if not), or an UPDATE (if it is). However if you do not want the UPDATE feature, it does not help you.

    --
    tune

      I tried that, but REPLACE only works on a unique index. I need to be able to have e.g.
      COL1 COL2 A B A C A D B C C D B D
      So neither column is unique.
      --
      Brovnik