Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

DBD::SQLite bulk insert woes

by dwalin (Monk)
on Jun 18, 2011 at 12:11 UTC ( #910323=perlquestion: print w/ replies, xml ) Need Help??
dwalin has asked for the wisdom of the Perl Monks concerning the following question:

While I was trying to squeeze the most in means of speed/CPU utilization out of my code, I have stumbled upon a peculiar DBD::SQLite behavior with regards to bulk data insert. I'd like to illustrate it with the following snippet:
#!/usr/bin/perl use strict; use warnings; use 5.14.0; use Benchmark; use DBI; my $dbh = DBI->connect('dbi:SQLite::memory:', '','', { AutoCommit => 1, RaiseError => 1, PrintError +=> 0, ChopBlanks => 1 }) or die DBI->errstr; $dbh->do("CREATE TABLE test_bulk_insert (\n" . (join ",\n", map { "field$_ int" } (1..$ARGV[0])) . ");\n") o +r die $dbh->errstr; my $sth = $dbh->prepare("INSERT INTO test_bulk_insert (\n" . (join ",\n", map { "field$_" } (1..$ARGV[0])) +. ') VALUES (' . (join ",\n", map { '?' } (1..$ARGV[0])) . ')') + or die $dbh->errstr; my @newdata; push @newdata, [] for (1..$ARGV[0]); $sth->bind_param_array($_ + 1, $newdata[$_]) for keys @newdata; my $data; $data .= sprintf(("%-4d"x$ARGV[0])."\n", map { int rand 9999 } (1..$ARGV[0])) for (1..$ARGV[1]); my $template = "A4"x$ARGV[0]; timethese($ARGV[2], { 'new' => sub { my $buf = $data; my ($curr, $prev); while ($buf =~ /\n/g) { $prev = defined $curr ? $curr : 0; $curr = pos $buf; my $_ = substr $buf, $prev, $curr - $prev; chomp; my @cols = unpack $template; push $_, shift @cols for (@newdata); }; eval { $dbh->do("BEGIN IMMEDIATE TRANSACTION"); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute_array({}, @newdata); $dbh->commit; } or die $@; $#{$newdata[$_]} = -1 for keys @newdata; }, 'old' => sub { my $buf = $data; my ($curr, $prev, @batch); while ($buf =~ /\n/g) { $prev = defined $curr ? $curr : 0; $curr = pos $buf; my $_ = substr $buf, $prev, $curr - $prev; chomp; push @batch, [unpack $template]; }; eval { $dbh->do("BEGIN IMMEDIATE TRANSACTION"); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute(@$_) for @batch; $dbh->commit; } or die $@; }, }); $dbh->disconnect;
The 'new' routine tries to take advantage of DBI execute_array method but somehow it fails miserably: instead of being faster it's actually twice slower:
dwalin@sol:/tmp$ perl 100 1000 100 Benchmark: timing 100 iterations of new, old... new: 22 wallclock secs (22.10 usr + 0.28 sys = 22.38 CPU) @ 4 +.47/s (n=100) old: 10 wallclock secs ( 9.87 usr + 0.08 sys = 9.95 CPU) @ 10 +.05/s (n=100)
And the cause of slowness is not my code, it's DBD::SQLite. Under Devel::NYTProf it looks most peculiar:

'old' line-by-line insert:

$sth->execute(@$_) for @batch; # spent 10.0ms making 10000 calls to DBI::st::execute, avg 1Ás/call

'new' bulk insert:

$sth->execute_array({}, @newdata); # spent 3.17s making 10 calls to DBI::st::execute_array, avg 317ms/ca +ll # spent 3.17s making 10 calls to DBD::_::st::execute_array, avg 317ms +/call

Something in Devel::NYTProf is visibly affecting execution; 'new' code becomes ten times slower than 'old' but the picture is the same - it's slower. Why is that? I tried to look inside DBD::SQLite but I'm not advanced enough to dig deep in DBI and its drivers.

I tested it with perl-5.14.1 under Solaris 9 x86.


Comment on DBD::SQLite bulk insert woes
Select or Download Code
Re: DBD::SQLite bulk insert woes
by danb (Friar) on Jun 18, 2011 at 19:13 UTC

    I modified your test slightly to run against several other databases: PostgreSQL and MySQL in addition to SQLite. What I found is the same result for all of them: your "new" is slower than "old":

    ./ 100 1000 100 Benchmark: timing 100 iterations of SQLite_new, SQLite_old... SQLite_new: 15 wallclock secs (15.27 usr + 0.07 sys = 15.34 CPU) @ 6 +.52/s (n=100) SQLite_old: 5 wallclock secs ( 5.91 usr + 0.03 sys = 5.94 CPU) @ 16 +.84/s (n=100) Benchmark: timing 100 iterations of PostgreSQL_new, PostgreSQL_old... PostgreSQL_new: 66 wallclock secs (26.55 usr + 1.75 sys = 28.30 CPU) +@ 3.53/s (n=100) PostgreSQL_old: 37 wallclock secs ( 7.67 usr + 1.14 sys = 8.81 CPU) +@ 11.35/s (n=100) Benchmark: timing 100 iterations of MySQL_new, MySQL_old... MySQL_new: 58 wallclock secs (26.45 usr + 1.43 sys = 27.88 CPU) @ 3 +.59/s (n=100) MySQL_old: 30 wallclock secs ( 8.09 usr + 0.85 sys = 8.94 CPU) @ 11 +.19/s (n=100)

    Maybe it's time to check our assumptions. Is it really given that execute_array() is supposed to be faster? If the driver provides an optimized implementation, then I would say yes. The DBI docs say "Drivers may provide more optimized implementations using whatever bulk operation support the database API provides." (emphasis mine). But what about these three databases/drivers, none of which appear to provide such an optimization?

    When execute_array just has to call execute() behind the scenes, is it expected to be 3X slower? As far as I can tell, the only difference *should* be that execute_array() has to copy a large array (of arrayrefs), but that doesn't seem like enough to cause that much slowdown.

    I think more investigation is necessary.

    Here is your test with the modifications I made:


      Well to be frank, I didn't really know what to expect. I just tried something new (to me, at least) in hope that it will be faster - and found that it wasn't. I don't have a clue why passing a bunch of references and running supposedly the same DBI::st::execute in supposedly the same loop runs twice as slow.

      If you can give any tips on improving DBD::SQLite bulk insert performance by other means, I'd really appreciate that. With any other SQL engine I would go with SQL optimizations but SQLite doesn't really give many options for that.


        I also don't know why there is the difference in benchmarks. After converting to Perl 5.10 code for my machine (push $aref.. and keys @array aren't valid until 5.14), I got similar results.

        If you are using a real disk drive and database is not "online", then
            $dbh->do('PRAGMA synchronous = 0');
        sped your 'old' benchmark up from 47 secs to 32 secs on my older model machine. The SQLite app that I'm working on now takes in this range for a 300K row table with mainly varchar's to be created. So I think you are in the "zone of reason" in terms of performance. Since you are mainly using ints, might be interesting to try with numeric values instead of text in your benchmark.

        There is a bulk import facility for SQLite that works with a csv format. I have never used it but, it might be quicker. I don't know where your actual data comes from.

Re: DBD::SQLite bulk insert woes
by Anonymous Monk on Jun 21, 2011 at 03:36 UTC
    When you use SQLite, you M*U*S*T use transactions. If you don't, then SQLite will verify that every disk write actually succeeded before returning. This slows everything to a crawl even on the fastest machines.
      What do you think this does?
      eval { $dbh->do("BEGIN IMMEDIATE TRANSACTION"); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute(@$_) for @batch; $dbh->commit; } or die $@;
      No it doesn't. You're confusing disk writes with commits which are actually quite different things. I would suggest reading this, this and this articles to gain better understanding of how SQLite transactions and disk writes work.

      In this case it doesn't matter anyway because the database is held in memory and there is no delay due to disk. It's a test focused on DBD::SQLite performance rather than SQLite itself.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://910323]
Approved by derby
Front-paged by derby
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-07-26 02:28 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (175 votes), past polls