Re: DBI::SQLite slowness
by Cristoforo (Curate) on Sep 20, 2013 at 02:19 UTC
|
It is slow because you have AutoCommit set to 1. It is committing for every insert. Just change that to 0 and $dbh->commit; after the foreach loop. | [reply] [d/l] [select] |
|
Brilliant! With that little fix, my speed is up to 2022 per second; that's almost workable, and I understand what was happening. Now time to start looking through the other suggestions.
| [reply] |
|
This is what I get on an Atom eee pc (1.6Ghz), after I removed
use v5.16.0;
and changed
say "Total time: ", (time - $start); # 180 seconds
to
print "Total time: ", (time - $start); # 180 seconds
time perl db.pl
Total time: 5
real 0m5.348s
user 0m0.360s
sys 0m0.820s
marica.fr : Gestion des contrats, des dossiers contentieux et des sinistres d'assurance
| [reply] [d/l] [select] |
|
|
Well, 200 millions records at a rate of 2000 per second, that's still 100,000 seconds, or almost 28 hours. That's still pretty long, isn't-it? Having said that, you may be able to live with that, a full day of processing is still manageable for a number of cases. Beware, though, that the rate might slow down as you database grows larger.
If you are really only looking for filtering out duplicates, the ideas discussed by BrowserUk are probably much better than using a database.
| [reply] |
Re: DBI::SQLite slowness
by BrowserUk (Patriarch) on Sep 20, 2013 at 03:43 UTC
|
Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.
The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.
The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:
C:\test>dir 1054929.dat
20/09/2013 04:22 2,800,000,000 1054929.dat
C:\test>wc -l 1054929.dat
200000000 1054929.dat
C:\test>head 1054929.dat
100112321443
100135127486
100110839892
100098464584
100098900542
100048844759
100090430059
100018238859
100132791659
100027638968
C:\test>1054929 1054929.dat | wc -l
1379647642.87527
1379647855.6311
113526721
That's processing the 12-digit numbers at a rate of just under 1 million per second.
You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
| [reply] [d/l] |
|
perl -e " my $f='' ; vec($f, 2**28,1)=1; "
I think that's marking as seen (as duplicate) the id 268435456 or 268_435_456
I think BrowserUk is probably assuming a minimum starting id, 268_435_456 is a lot of IDs :) 256MiB
| [reply] [d/l] |
|
|
| [reply] |
Re: DBI::SQLite slowness
by Jenda (Abbot) on Sep 20, 2013 at 09:17 UTC
|
When a hash based solution starts to choke because of the amount of data, the easiest solution is to use an on-disk hash. See BerkeleyDB or DB_File. The only change necessary will be to load the module and tie the hash. The rest of the script will stay the same.
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] |
|
I tried that a couple of time, the performance on inserting new records just became really bad. It may be OK when the DB file already exists and is more or less static, but apparently not when you are constantly adding new records. It may also be that I did not do it the right way, but I tried various things, I don't see what I might have done wrong.
| [reply] |
|
| [reply] |
|
Re: DBI::SQLite slowness (DBD::SQLite bulk insert woes)
by Anonymous Monk on Sep 20, 2013 at 02:32 UTC
|
## begin transaction
## foreach loop
## end transaction
see also Re^5: Index a file with pack for fast access/Sauntering Coder: Fastest bulk import into sqlite which says | [reply] [d/l] [select] |
|
| [reply] |
|
I can attest to efficiency improvements due to at least one of the notes in SQLite Optimization FAQ. Used it at a customer site once, too long ago to remember the numbers but I recall the difference was dramatic. Sample of the commands needed:
C:\Steve>sqlite test.db
C:\Steve>call sqlite-STEVEM2 test.db
C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db
SQLite version 3.3.5
Enter ".help" for instructions
sqlite> pragma default_cache_size;
2000
sqlite> pragma default_cache_size=200000;
sqlite> pragma default_cache_size;
200000
sqlite> .exit
C:\Steve>sqlite test.db
C:\Steve>call sqlite-STEVEM2 test.db
C:\Steve>"C:\App\SQLite\sqlite3.exe" test.db
SQLite version 3.3.5
Enter ".help" for instructions
sqlite> pragma default_cache_size;
200000
sqlite> .quit
C:\Steve>
| [reply] [d/l] |
Re: DBI::SQLite slowness
by sam_bakki (Pilgrim) on Sep 22, 2013 at 06:18 UTC
|
Hi
If you still like to use SQlite, Try this
$dbHandle->do("PRAGMA journal_mode = WAL");
http://sqlite.org/pragma.html#pragma_journal_mode
| [reply] [d/l] |
|
Hi
It looks from that page that DELETE is the default ("The DELETE journaling mode is the normal behavior.").
Using WAL (Write Ahead Log) is another writing step (slow), so wouldn't it make more sense to switch journaling off:
$dbHandle->do("PRAGMA journal_mode = OFF");
That would make the transaction(s) less reliable in case of a crash, but in this particular case that seems unimportant (initial data load, ephemeral data). I would expect journal_mode = OFF to be faster than either the default DELETE, or writing WAL.
(Then again, I don't use SQLite so I may be completely wrong. If anyone does the tests, I'd be interested to see the speed differences.)
| [reply] [d/l] |
|
Thanks for the suggestions. I've tried it with both WAL and OFF without a change in the results on my test-dataset of 83,000 entries. I'll keep it in mind when I try on my real datasets too, though.
| [reply] |
Re: DBI::SQLite slowness
by Anonymous Monk on Sep 20, 2013 at 12:47 UTC
|
You must perform SQLite operations in a transaction, because if you do not, it will re-read and re-verify e-v-e-r-y I/O operation, by design. | [reply] |
|
If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions?
| [reply] |
|
If I simply have "$dbh->commit" following the For loop, is that effectively the same as explicitly using transactions?
No; BEGIN WORK (or BEGIN TRANSACTION or just plain BEGIN) starts a transaction in SQL. Then (normally after 1 or more insert/delete/update's) COMMIT commits the transaction (or, when an error occurs, ROLLBACK rolls the transaction back to the state just before the BEGIN).
See DBI's begin_work
(I don't use SQLite; in the above I am assuming SQLite does this the same way that other (RDBMS) databases do)
| [reply] |
|
|