Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Database input speed question

by dga (Hermit)
on Jul 31, 2003 at 17:10 UTC ( #279664=note: print w/replies, xml ) Need Help??

in reply to Database input speed question

Another thing that I didn't see mentioned is that if you can't use a bulk loader (computed fields etc.), then if you can use a transaction and do the inserts then commit. If you have to use DBI to insert lots of data this will be a big time savings since the statement preparation only happens one time instead of once per record.

#... script starting stuff up here my $dbh=DBI->connect("DBI:Pg(RaiseError=>1, AutoCommit=>0):dbname=$dat +abase"); my $stmt="INSERT INTO table ( field1, field2 ) VALUES ( ?, ? )"; eval { $dbh->rollback; my $sth=$dbh->prepare($stmt); while(<INPUT>) { my($field1, $field2)=split; $sth->execute($field1, $field2); } $dbh->commit; }; if($@) { print STDERR "Data did not insert correctly: $@"; $dbh->rollback; }

This does 2 things: The inserts go a lot faster. The inserts go in as a group or not at all which makes cleaning up after a failure a lot easier.

The :Pg loads up a PostgreSQL connection which I used in this example since that's the database I use most.

The RaiseError=>1 turns failures in DBI calls into fatal errors which the eval traps and reports on.

The AutoCommit=>0 tells DBI not to commit records until you call commit();

Replies are listed 'Best First'.
Re: Re: Database input speed question
by mpeppler (Vicar) on Jul 31, 2003 at 18:16 UTC
    I was going to suggest something similar, but there are some issues with this as well.

    First, one must make sure to still commit at regular intervals. The original poster mentioned 10M rows (I'm guessing 10 million) - you'd have to have a rather large transaction log if you're going to batch all of these rows in a single transaction.

    Second, batching may not necessarily gain you all that much - certainly with Sybase the total amount of work that the database needs to do whether you batch transactions or not is the same - the only real difference is that rows are only committed to the "real" table when the server gets the COMMIT TRAN command instead of placing each row there individually (i.e. the total amount of disk IO is the same).


      When in doubt, Benchmark

      I made up the following benchmark to see the effect of AutoCommit on inserting rows in bulk.

      use strict; use warnings; use Benchmark qw( cmpthese ); use DBI; my $dbc=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>0):dbname=...') +; my $dba=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>1):dbname=...') +; cmpthese ( 10, { 'ac' => sub { &inserts($dba, 'auto', 1000, 0 ) }, 'mc' => sub { &inserts($dbc, 'manual', 1000, 1 ) }, }); sub inserts { my($dbh, $table, $rows, $commit)=@_; my $stmt=qq[ INSERT INTO $table ( id, val ) VAlUES ( ?, ? ) ]; eval { $dbh->rollback if($commit); my $sth=$dbh->prepare($stmt); foreach my $row ( 1..$rows ) { $sth->execute( $row, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdef +ghijklmnopqrstuvwxyz/' ); } $dbh->commit if($commit); }; if($@) { print STDERR "Insert Failed: $stmt: $@"; $dbh->rollback; die "Program Terminated"; } return; } END { $dbc->disconnect; $dba->disconnect };

      Here are some results.

      #run 1 Benchmark: timing 10 iterations of ac, mc... ac: 33 wallclock secs ( 0.33 usr + 0.56 sys = 0.89 CPU) @ 11.24/s ( +n=10) mc: 3 wallclock secs ( 0.38 usr + 0.35 sys = 0.73 CPU) @ 13.70/s ( +n=10) #run 2 Benchmark: timing 10 iterations of ac, mc... ac: 37 wallclock secs ( 0.41 usr + 0.81 sys = 1.22 CPU) @ 8.20/s ( +n=10) mc: 4 wallclock secs ( 0.37 usr + 0.50 sys = 0.87 CPU) @ 11.49/s ( +n=10) #run 3 Benchmark: timing 10 iterations of ac, mc... ac: 38 wallclock secs ( 0.48 usr + 0.60 sys = 1.08 CPU) @ 9.26/s ( +n=10) mc: 4 wallclock secs ( 0.38 usr + 0.40 sys = 0.78 CPU) @ 12.82/s ( +n=10)

      Note that I am comparing wall clock time since the perl code has very little to do. I ran 3 runs so that a representative sample could be obtained. This is running against PostgreSQL as the backend on the local host so there is minimal communication overhead.

      Committing after each 1000 rows in this test consistantly yields a 10 fold increase in speed over using AutoCommit. As usual YMMV and will certainly vary if you use a different database engine. Also note that using the bulk data importer from a text file containing the same data takes less than 1 second to complete while running 1 insert with 1 commit for 10000 rows takes about 3 seconds.

      The data set size in this test is only 663k of data. I am estimating that a significant portion of the time difference is that when commit returns, the database pledges that the data has been written to durable media. So for the manual commits this happens 10 times whereas for the AutoCommit this occurs 10000 times. If that were all the variability then manual commit would be 1000 times faster instead of 10 times so the actual writing of the data constitutes a big portion of the time and that, as mentioned, is the same for any approach.

        When in doubt, Benchmark
        Good point.

        I'll try to run that next week with Sybase ASE.

        Note that as with all things of this type a lot depends on the actual data being loaded, whether there are any triggers, indexes, etc...


        A quick run of that code on my laptop (RH 7.3, Athlon 1.something GHz, Sybase ASE shows that I was wrong, and that batched commits almost certainly improve the performance for large inserts:
        ac: 129 wallclock secs ( 2.77 usr + 0.40 sys = 3.17 CPU) @ 3.15/s ( +n=10) mc: 14 wallclock secs ( 2.05 usr + 0.40 sys = 2.45 CPU) @ 4.08/s (n +=10)
        This is with no indexes on the tables.

        Still, it's an interesting problem - I'll have to try this on some more serious iron, and with more data, and, ideally, with raw devices rather than disk devices that have O_DSYNC turned on.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://279664]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2021-06-19 09:27 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (91 votes). Check out past polls.