http://www.perlmonks.org?node_id=1054929

Endless has asked for the wisdom of the Perl Monks concerning the following question:

It was suggested to me that I use a database to help prune out duplicate entries (based on 12-digit id numbers) in data I'm processing. I had previously been using a hash, but it starts to choke by 55 million entries. So I fired up DBI::SQLite to give it a try. And I was shocked by what I found.

I simply must have something wrong, because inserting into an sqlite database seems to be occurring at a rate of only 5 per second for me; just writing raw to hashes was going between 11,000 and 20,000 per second. This is unbelievably, unusably slow. What is going on?

#!/usr/bin/perl use strict; use warnings; use v5.16.0; use lib 'lib'; use DBI; use Carp 'croak'; my $dbh = db_handle('vals.db'); my $sql_table = <<"SQL"; CREATE TABLE IF NOT EXISTS data ( klout INTEGER ); SQL $dbh->do($sql_table); my $sql_insert_statement = "INSERT INTO data (klout) VALUES (?)"; my $sth = $dbh->prepare($sql_insert_statement); my $start = time; foreach (1..1000) { $sth->execute($_); } say "Total time: ", (time - $start); # 180 seconds sub db_handle { my $db_file = shift or croak "db_handle() requires a database name"; no warnings 'once'; return DBI->connect ( "dbi:SQLite:dbname=$db_file", "", #no username "", #no password { RaiseError => 1, PrintError => 0, AutoCommit => 1 }, ) or die $DBH::errstr; }

Replies are listed 'Best First'.
Re: DBI::SQLite slowness
by Cristoforo (Curate) on Sep 20, 2013 at 02:19 UTC
    It is slow because you have AutoCommit set to 1. It is committing for every insert. Just change that to 0 and $dbh->commit; after the foreach loop.
      Brilliant! With that little fix, my speed is up to 2022 per second; that's almost workable, and I understand what was happening. Now time to start looking through the other suggestions.
        This is what I get on an Atom eee pc (1.6Ghz), after I removed
        use v5.16.0;
        and changed
        say "Total time: ", (time - $start); # 180 seconds
        to
        print "Total time: ", (time - $start); # 180 seconds
        time perl db.pl Total time: 5 real 0m5.348s user 0m0.360s sys 0m0.820s
        marica.fr : Gestion des contrats, des dossiers contentieux et des sinistres d'assurance

        Well, 200 millions records at a rate of 2000 per second, that's still 100,000 seconds, or almost 28 hours. That's still pretty long, isn't-it? Having said that, you may be able to live with that, a full day of processing is still manageable for a number of cases. Beware, though, that the rate might slow down as you database grows larger.

        If you are really only looking for filtering out duplicates, the ideas discussed by BrowserUk are probably much better than using a database.

Re: DBI::SQLite slowness
by BrowserUk (Patriarch) on Sep 20, 2013 at 03:43 UTC

    Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.

    The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.

    The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:

    C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat | wc -l 1379647642.87527 1379647855.6311 113526721

    That's processing the 12-digit numbers at a rate of just under 1 million per second.

    You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi BrowserUk,

      probably it's too early in the morning, but I can't see the script doing the dedup. There seems to be a magic program/script called 1054929. What am I missing?

      Best regards
      McA

        What am I missing?

        Nothing. I didn't post the actual script; just demonstrated that this wasn't an idle boast.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: DBI::SQLite slowness
by Jenda (Abbot) on Sep 20, 2013 at 09:17 UTC

    When a hash based solution starts to choke because of the amount of data, the easiest solution is to use an on-disk hash. See BerkeleyDB or DB_File. The only change necessary will be to load the module and tie the hash. The rest of the script will stay the same.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I tried that a couple of time, the performance on inserting new records just became really bad. It may be OK when the DB file already exists and is more or less static, but apparently not when you are constantly adding new records. It may also be that I did not do it the right way, but I tried various things, I don't see what I might have done wrong.

        It will always be quite a bit slower than when the hash fits in memory, but still much quicker than if the hash grows so big that the computer has to swap memory pages onto disk.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: DBI::SQLite slowness (DBD::SQLite bulk insert woes)
by Anonymous Monk on Sep 20, 2013 at 02:32 UTC
        I can attest to efficiency improvements due to at least one of the notes in SQLite Optimization FAQ. Used it at a customer site once, too long ago to remember the numbers but I recall the difference was dramatic. Sample of the commands needed:
        C:\Steve>sqlite test.db C:\Steve>call sqlite-STEVEM2 test.db C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db SQLite version 3.3.5 Enter ".help" for instructions sqlite> pragma default_cache_size; 2000 sqlite> pragma default_cache_size=200000; sqlite> pragma default_cache_size; 200000 sqlite> .exit C:\Steve>sqlite test.db C:\Steve>call sqlite-STEVEM2 test.db C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db SQLite version 3.3.5 Enter ".help" for instructions sqlite> pragma default_cache_size; 200000 sqlite> .quit C:\Steve>
Re: DBI::SQLite slowness
by sam_bakki (Pilgrim) on Sep 22, 2013 at 06:18 UTC

      Hi

      It looks from that page that DELETE is the default ("The DELETE journaling mode is the normal behavior.").

      Using WAL (Write Ahead Log) is another writing step (slow), so wouldn't it make more sense to switch journaling off:

      $dbHandle->do("PRAGMA journal_mode = OFF");

      That would make the transaction(s) less reliable in case of a crash, but in this particular case that seems unimportant (initial data load, ephemeral data). I would expect journal_mode = OFF to be faster than either the default DELETE, or writing WAL.

      (Then again, I don't use SQLite so I may be completely wrong. If anyone does the tests, I'd be interested to see the speed differences.)

      Thanks for the suggestions. I've tried it with both WAL and OFF without a change in the results on my test-dataset of 83,000 entries. I'll keep it in mind when I try on my real datasets too, though.
Re: DBI::SQLite slowness
by Anonymous Monk on Sep 20, 2013 at 12:47 UTC
    You must perform SQLite operations in a transaction, because if you do not, it will re-read and re-verify e-v-e-r-y I/O operation, by design.
      If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions?
        If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions?

        No; BEGIN WORK (or BEGIN TRANSACTION or just plain BEGIN) starts a transaction in SQL. Then (normally after 1 or more insert/delete/update's) COMMIT commits the transaction (or, when an error occurs, ROLLBACK rolls the transaction back to the state just before the BEGIN).

        See DBI's begin_work

        (I don't use SQLite; in the above I am assuming SQLite does this the same way that other (RDBMS) databases do)