DBI::SQLite slowness

Endless has asked for the wisdom of the Perl Monks concerning the following question:

It was suggested to me that I use a database to help prune out duplicate entries (based on 12-digit id numbers) in data I'm processing. I had previously been using a hash, but it starts to choke by 55 million entries. So I fired up DBI::SQLite to give it a try. And I was shocked by what I found.

I simply must have something wrong, because inserting into an sqlite database seems to be occurring at a rate of only 5 per second for me; just writing raw to hashes was going between 11,000 and 20,000 per second. This is unbelievably, unusably slow. What is going on?

#!/usr/bin/perl
use strict;
use warnings;

use v5.16.0;
use lib 'lib';
use DBI;
use Carp 'croak';


my $dbh = db_handle('vals.db');
my $sql_table = <<"SQL";
CREATE TABLE IF NOT EXISTS data (
    klout        INTEGER
    );
SQL
    $dbh->do($sql_table);

my $sql_insert_statement = "INSERT INTO data (klout) VALUES (?)";
my $sth = $dbh->prepare($sql_insert_statement);
my $start = time;
foreach (1..1000) {
    $sth->execute($_);
}
say "Total time: ", (time - $start); # 180 seconds

sub db_handle {
    my $db_file = shift
    or croak "db_handle() requires a database name";
    no warnings 'once';
    return DBI->connect (
    "dbi:SQLite:dbname=$db_file",
    "", #no username
    "", #no password
    { RaiseError => 1, PrintError => 0, AutoCommit => 1 },
    ) or die $DBH::errstr;
}
[download]

Comment on DBI::SQLite slowness Download Code

Replies are listed 'Best First'.
Re: DBI::SQLite slowness by Cristoforo (Curate) on Sep 20, 2013 at 02:19 UTC
It is slow because you have `AutoCommit` set to 1. It is committing for every insert. Just change that to 0 and `$dbh->commit;` after the foreach loop.	[reply] [d/l] [select]
Re^2: DBI::SQLite slowness by Endless (Beadle) on Sep 20, 2013 at 12:23 UTC
Brilliant! With that little fix, my speed is up to 2022 per second; that's almost workable, and I understand what was happening. Now time to start looking through the other suggestions.	[reply]
Re^3: DBI::SQLite slowness by vincent_veyron (Sexton) on Sep 20, 2013 at 16:20 UTC
This is what I get on an Atom eee pc (1.6Ghz), after I removed `use v5.16.0;` [download] and changed `say "Total time: ", (time - $start); # 180 seconds` [download] to `print "Total time: ", (time - $start); # 180 seconds` [download] `time perl db.pl Total time: 5 real 0m5.348s user 0m0.360s sys 0m0.820s` [download] marica.fr : Gestion des contrats, des dossiers contentieux et des sinistres d'assurance	[reply] [d/l] [select]
Re^4: DBI::SQLite slowness by Endless (Beadle) on Sep 20, 2013 at 19:12 UTC
Re^3: DBI::SQLite slowness by Laurent_R (Canon) on Sep 21, 2013 at 09:33 UTC
Well, 200 millions records at a rate of 2000 per second, that's still 100,000 seconds, or almost 28 hours. That's still pretty long, isn't-it? Having said that, you may be able to live with that, a full day of processing is still manageable for a number of cases. Beware, though, that the rate might slow down as you database grows larger. If you are really only looking for filtering out duplicates, the ideas discussed by BrowserUk are probably much better than using a database.	[reply]
Re: DBI::SQLite slowness by BrowserUk (Patriarch) on Sep 20, 2013 at 03:43 UTC
Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace. The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed. The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so: `C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat \| wc -l 1379647642.87527 1379647855.6311 113526721` [download] That's processing the 12-digit numbers at a rate of just under 1 million per second. You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: DBI::SQLite slowness by McA (Priest) on Sep 20, 2013 at 04:01 UTC
Hi BrowserUk, probably it's too early in the morning, but I can't see the script doing the dedup. There seems to be a magic program/script called `1054929`. What am I missing? Best regards McA	[reply] [d/l]
Re^3: DBI::SQLite slowness by Anonymous Monk on Sep 20, 2013 at 04:59 UTC
I imagine BrowserUk is using vec See Re: vec overflow?, Re: searching a vector array, Re^2: Array vs. Hash for sparsely integer-indexed data (bit vectors) This takes about 33MB of ram, `perl -e " my $f='' ; vec($f, 2**28,1)=1; "` [download] I think that's marking as seen (as duplicate) the id 268435456 or 268_435_456 I think BrowserUk is probably assuming a minimum starting id, 268_435_456 is a lot of IDs :) 256MiB	[reply] [d/l]
Re^4: DBI::SQLite slowness ( mini vec tutorial ) by Anonymous Monk on Sep 20, 2013 at 13:10 UTC
Re^3: DBI::SQLite slowness by BrowserUk (Patriarch) on Sep 20, 2013 at 05:17 UTC
What am I missing? Nothing. I didn't post the actual script; just demonstrated that this wasn't an idle boast. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: DBI::SQLite slowness by Jenda (Abbot) on Sep 20, 2013 at 09:17 UTC
When a hash based solution starts to choke because of the amount of data, the easiest solution is to use an on-disk hash. See BerkeleyDB or DB_File. The only change necessary will be to load the module and tie the hash. The rest of the script will stay the same. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^2: DBI::SQLite slowness by Laurent_R (Canon) on Sep 21, 2013 at 09:24 UTC
I tried that a couple of time, the performance on inserting new records just became really bad. It may be OK when the DB file already exists and is more or less static, but apparently not when you are constantly adding new records. It may also be that I did not do it the right way, but I tried various things, I don't see what I might have done wrong.	[reply]
Re^3: DBI::SQLite slowness by Jenda (Abbot) on Sep 21, 2013 at 16:10 UTC
It will always be quite a bit slower than when the hash fits in memory, but still much quicker than if the hash grows so big that the computer has to swap memory pages onto disk. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^4: DBI::SQLite slowness by Laurent_R (Canon) on Sep 21, 2013 at 16:48 UTC
Re: DBI::SQLite slowness (DBD::SQLite bulk insert woes) by Anonymous Monk on Sep 20, 2013 at 02:32 UTC
See DBD::SQLite bulk insert woes and use http://metacpan.org/module/DBD::SQLite#Transactions, so `## begin transaction ## foreach loop ## end transaction` [download] see also Re^5: Index a file with pack for fast access/Sauntering Coder: Fastest bulk import into sqlite which says Read more... (1144 Bytes)	[reply] [d/l] [select]
Re^2: DBI::SQLite slowness (DBD::SQLite bulk insert woes) by Anonymous Monk on Sep 20, 2013 at 02:45 UTC
See also SQLite Optimization FAQ (old but informative) and SQLite Slow insert (use transactions)	[reply]
Re^3: DBI::SQLite slowness (DBD::SQLite bulk insert woes) by marinersk (Priest) on Sep 20, 2013 at 20:31 UTC
I can attest to efficiency improvements due to at least one of the notes in SQLite Optimization FAQ. Used it at a customer site once, too long ago to remember the numbers but I recall the difference was dramatic. Sample of the commands needed: C:\Steve>sqlite test.db C:\Steve>call sqlite-STEVEM2 test.db C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db SQLite version 3.3.5 Enter ".help" for instructions sqlite> pragma default_cache_size; 2000 sqlite> pragma default_cache_size=200000; sqlite> pragma default_cache_size; 200000 sqlite> .exit C:\Steve>sqlite test.db C:\Steve>call sqlite-STEVEM2 test.db C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db SQLite version 3.3.5 Enter ".help" for instructions sqlite> pragma default_cache_size; 200000 sqlite> .quit C:\Steve> [download]	[reply] [d/l]
Re: DBI::SQLite slowness by sam_bakki (Pilgrim) on Sep 22, 2013 at 06:18 UTC
Hi If you still like to use SQlite, Try this `$dbHandle->do("PRAGMA journal_mode = WAL");` [download] http://sqlite.org/pragma.html#pragma_journal_mode Thanks & Regards, Bakkiaraj M My Perl Gtk2 technology demo project - http://code.google.com/p/saaral-soft-search-spider/ , contributions are welcome.	[reply] [d/l]
Re^2: DBI::SQLite slowness by erix (Prior) on Sep 22, 2013 at 06:42 UTC
Hi It looks from that page that DELETE is the default ("The DELETE journaling mode is the normal behavior."). Using WAL (Write Ahead Log) is another writing step (slow), so wouldn't it make more sense to switch journaling off: `$dbHandle->do("PRAGMA journal_mode = OFF");` [download] That would make the transaction(s) less reliable in case of a crash, but in this particular case that seems unimportant (initial data load, ephemeral data). I would expect journal_mode = OFF to be faster than either the default DELETE, or writing WAL. (Then again, I don't use SQLite so I may be completely wrong. If anyone does the tests, I'd be interested to see the speed differences.)	[reply] [d/l]
Re^2: DBI::SQLite slowness by Endless (Beadle) on Sep 23, 2013 at 17:59 UTC
Thanks for the suggestions. I've tried it with both WAL and OFF without a change in the results on my test-dataset of 83,000 entries. I'll keep it in mind when I try on my real datasets too, though.	[reply]
Re: DBI::SQLite slowness by Anonymous Monk on Sep 20, 2013 at 12:47 UTC
You must perform SQLite operations in a transaction, because if you do not, it will re-read and re-verify e-v-e-r-y I/O operation, by design.	[reply]
Re^2: DBI::SQLite slowness by Endless (Beadle) on Sep 20, 2013 at 17:59 UTC
If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions?	[reply]
Re^3: DBI::SQLite slowness by erix (Prior) on Sep 21, 2013 at 13:13 UTC
If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions? No; BEGIN WORK (or BEGIN TRANSACTION or just plain BEGIN) starts a transaction in SQL. Then (normally after 1 or more insert/delete/update's) COMMIT commits the transaction (or, when an error occurs, ROLLBACK rolls the transaction back to the state just before the BEGIN). See DBI's begin_work (I don't use SQLite; in the above I am assuming SQLite does this the same way that other (RDBMS) databases do)	[reply]
Re^4: DBI::SQLite slowness by Endless (Beadle) on Sep 21, 2013 at 18:20 UTC
Re^5: DBI::SQLite slowness by erix (Prior) on Sep 22, 2013 at 05:12 UTC

Back to Seekers of Perl Wisdom