Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

DBI: How to update 150K records efficiently

by lihao (Monk)
on Mar 31, 2008 at 13:51 UTC ( #677521=perlquestion: print w/ replies, xml ) Need Help??
lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks

In my Perl code, I am now using DBIx::Simple to connect with my Database.. To update a bunch of records saved in a file, I just loop through each line in the file and update records one by one.

..[skip].. while (my $record = <$input>) { chomp $record; my ($val1, $val2, $val3) = (split(/\|/,$record))[1,3,5]; my %data = ( va => $val1, vb => $val2, ); my %where = ( vc => $val3, ); $dbh->update('myTable', \%data, \%where); }

This works quite slow. Is there any better way in DBIx::Simple or any other easy-to-use DBI ways/models to implement fast update of many records(i.e. 150K records, the table contains totally 200K records). thanks..

lihao

Comment on DBI: How to update 150K records efficiently
Download Code
Re: DBI: How to update 150K records efficiently
by moritz (Cardinal) on Mar 31, 2008 at 14:10 UTC
    If you haven't benchmarked it yet, try this:
    my $sth = $dbh->prepare('UPDATE myTable SET va = ?, vb = ? WHERE vc = +?'); while (my $record = <$input>){ my @values = (split(/\|/,$record))[1,3,5]; $sth->execute(@values); } $sth->finish()

    At least it avoids generating hashes for everything, and it uses a cached "prepared statement".

      Oh, but the OP's code also uses a cached prepared statement. DBIx::Simple handles this internally. The problem is that before it can access the cache, it has to generate the query string all over :)

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re: DBI: How to update 150K records efficiently
by Juerd (Abbot) on Mar 31, 2008 at 14:15 UTC

    Likely to be a significant slowdown, is the SQL abstraction done by SQL::Abstract. By including this in the loop, you make SQL::Abstract work once for each record.

    The solution would be to get back to basics by writing the SQL yourself.

    $db->query("UPDATE myTable SET va = ?, vb = ? WHERE vc = ?", $val1, $v +al2, $val3);

    By the way, you should not call your DBIx::Simple $dbh. This is the common name for DBI objects, and objects of classes that inherit from DBI. Using the name $dbh for DBIx::Simple objects is kind of confusing. I suggest calling it $db instead.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Thanks all for the helpful replies. :-)I've adjusted my code to:

      my $sql = qq[ UPDATE `myTable` SET `va` = ?, `vb` = ? WHERE `vc` = ? ]; while (my $record = <$input>) { chomp $record; my ($val1, $val2, $val3) = (split(/\|/,$record))[1,3,5]; $db->query($sql, $val1, $val2, $val3); }

      My question is: can we get the same 'prepare' effects as moritz mentioned in his post, since 'prepare' is done before the while loop..:)

      BTW. thank you for your module, I've used it in most of my applications.

      Best,
      lihao

        My question is: can we get the same 'prepare' effects as moritz mentioned in his post, since 'prepare' is done before the while loop..:)

        That is done automatically. DBIx::Simple recognises that the query is the same each time, and re-uses the old statement handle that represents the already compiled query.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re: DBI: How to update 150K records efficiently
by herveus (Parson) on Mar 31, 2008 at 14:22 UTC
    Howdy!

    Doing transactions that contain multiple updates will buy you some, in my experience. You should do some testing to find out how much benefit you get from various batch sizes.

    I'm not immediately able to sort out if that call to update uses a prepared and cached statement. A prepared statement with placeholders will probably be faster than preparing the statement each time.

    Are you always updating the one table, or is the code simply an example?

    Assuming it's not just illustrative, I'd do:

    my $batchsize = 20; # tune this value my $count = 0; my $sth = $dbh->prepare('update myTable set va = ?, vb = ? where vc = +?'); $dbh=>begin_work; # or however you start a transaction; don't take my +word on this line while (my $record = <$input>) { chomp $record; my ($va, $vb, $vc) = (split(/\|/, $record))[1, 3, 5}; $sth->execute($va, $vb, $vc); $count += 1; if ($count % $batchsize == 0) { $dbh->commit; # doublecheck the commit statement too $dbh->begin_work; } }
    Now you can experiment with different batch sizes to see how it speeds up.

    yours,
    Michael
Re: DBI: How to update 150K records efficiently
by jhourcle (Prior) on Mar 31, 2008 at 15:22 UTC

    You might want to look into the documentation for your database, as there's likely a bulk-loading ability to handle this sort of replacement.

    In Oracle, it's the command 'sqlldr'. For mysql, use the 'LOAD DATA LOCAL INFILE' command with the 'REPLACE' keyword. (note -- removing the LOCAL keyword will likely result in a failure on permissions).

    There are also tricks that you can use when handling this size of table -- I copy the table, drop all indexes, load the data, then recreate the indexes and replace the original table. This saves the indexes needing to update for each insertion. With oracle, you can defer constraints, so they're done in mass. (but, if you screw up, you have to roll back the entire transaction).

    For comparison ... I updated a mysql table this morning (6.2 million records), using 'LOAD DATA' in under 7.5 min:

    In your example, if your primary key is 'vc', you should be able to load it as a pipe delim file

      If I'm not mistaken, the use of LOAD DATA INFILE (or oracle "sqlldr") is only suitable for doing insertions to a table -- it doesn't do updates. (But maybe I'm mistaken?)

      In any case, since the OP is clearly talking about UPDATE, I think the "REPLACE" keyword would definitely not be good, unless the input to the process was in fact a complete set of replacement records for the table. And even then it might still be a really bad idea, if the table involves autoincrement primary keys that are involved in foreign key relations elsewhere. (Because reloading the whole table is apt to assign a completely different set of autoincrement keys.) Again, if I am wrong, I'd love to know...

        Although the primary use is insertion, they can be used for updates ... mysql directly, oracle in a more round-about way.

        The 'REPLACE' keyword in mysql's 'LOAD DATA' comment is record-by-record, based on the primary key of the table. If you're using autoincrement for for primary key, this isn't going to work for you. In the example given, however, there's a WHERE clause using the field 'vc', which I assumed was the primary key, or at least a unique index.

        You're right that sqlldr doesn't handle this case. (its REPLACE keyword will remove the entire table, as will TRUNCATE. The only other option for tables w/ existing data is 'APPEND'.) However, you can use sqlldr to get the data into a temp table, and then use a pl/sql command to replace the records as needed (or insert into the temp tables those that don't have a corresponding record already). You'd have to benchmark it to see what's the best method for your specific situation.

        If you're going to be designing a table that requires regular bulk updates, I'd highly recommend finding a suitable key (even if it's a composite key) and not use a sequence or autoincrement

Re: DBI: How to update 150K records efficiently
by sundialsvc4 (Abbot) on Mar 31, 2008 at 20:46 UTC

    I'd summarize my thoughts this way (most are redundant):

    1. Sometimes the abstractions provided by a class are good and handy; sometimes “desperate times call for desperate measures” and this is a good example of that. What you have done looks fine.
    2. Using a prepared query is definitely helpful.
    3. It is wise to code any such update-routine defensively, even at the expense of speed. You are, after all, about to update 3/4 of the entire table!
    4. If your database-manager supports it, transactions can be very helpful:   start a transaction, do a few hundred updates, then commit, then start a new transaction. (Don't forget to commit the last one.) Study “transaction isolation levels.” Consider how much concurrent table-activity you can... or should... tolerate.
    5. Since you are, indeed, updating such a large percentage of the table, and such a large table, what if you did things very differently?
      • Upload the information in-quantity to a different table by whatever means.
      • Then, do a transaction on the server which orders it to empty the first table, insert into it from the new table, then commit.
      Bang! Now all the updates happen, when they happen, entirely on the server. Or maybe you can actually DROP the old table and rename the temporary so that it becomes permanent. “Desperate times...”

    No matter how you choose to do it, this is going to be a “10,000 pound elephant” operation. It's going to be a whole lot slower than a lot of the things you're doing, and when it sits down or goes anywhere you're going to feel the floorboards shake.

Re: DBI: How to update 150K records efficiently
by graff (Chancellor) on Apr 01, 2008 at 05:01 UTC
    You may want to check into controlling the AutoCommit attribute of the DBI connection. (This relates to the notion of transaction handling mentioned above.) I know from personal experience (on a mirrored MySQL server) that AutoCommit being on vs. off (and issuing a commit(), say, every thousand updates or so, instead of on every update) can make a major difference in speed (esp. on a mirrored server).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://677521]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-11-21 04:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (105 votes), past polls