|Perl: the Markov chain saw|
Re: dynamic bulk insert in sqliteby Marshall (Abbot)
|on Sep 23, 2012 at 00:46 UTC||Need Help??|
I actually haven't used the bulk import to SQLite - I actually don't even have the command line installed on my machine! I don't need it! I just installed the Perl DBI::SQLite module and that was it.
There is a firefox add-on SQLite Manger and it easier to use than the command line. I highly recommend this gizmo! I am not a super experienced SQL guy, but I do have a MySQL server and its command line utils on my machine. SQLite is way easier to use (no admin hassle) and I've found this graphical firefox add-on to be very helpful for viewing an existing database.
I "hand code" the table creation and inserts because it is easy (a page+ of code), runs very fast (same as fastest bulk insert utility) and I'm often fiddling the column names and I make decisions about which columns to index - that's important. Some of my DB's have literally millions of rows.
I've only made about a dozen DB's, so I am inexperienced, but I have learned few things...
First tip is that when I create the table, I use lower case and "_" instead of spaces in the names. There is a hash variant in the DBI that lowercases the column names that it returns. If everybody is lower case to begin with this makes it easier to mess with later (gurus corrections to this observation are welcome- but that appears to be the case from what I've seen so far). A space is allowed in a hash key, but again eliminating that simplifies things later in actual use.
I guess you will start with perhaps some CSV file? The first line is typically the column names. I adjust them as explained above. In the CREATE TABLE statement, you have choices about type (VARCHAR, INTEGER, etc) as well as DEFAULT and Null or not allowed. This is application specific and you will have to decide what you want to do. These choices affect future table inserts.
There are 2 main things that affect the speed of the inserts. The first of these is:
If this is a new DB creation, this is perfectly fine. This causes the next write operation to proceed without confirmation that the previous one succeeded. A typical speed increase might be from 45 seconds to 30 seconds based upon this. If you are adding to an existing DB, then this is not a good idea because it is not transaction safe.
By far and away the LARGEST speed increase is achieved my making the entire import a single transaction!
I have some actual benchmarks on Monks, but wasn't able to find them easily with my first search, but basically this can be the difference between a couple of minutes and a couple of days!. This is the single most important thing to do!
SQlite's cache size can be dynamically varied as the DB runs. When doing inserts, this won't make much difference. This will make a large difference when indexing the tables.
I think the default is like 20 MB, increasing this limit will dramatically affect index creation on large tables! In SQLite, you can actually move this param up and down dynamically. But for a bulk import program, just set it to 200 MB or 400 MB and don't worry about it. A time difference might be 15 seconds instead of 3 minutes. Worth doing.
It is actually possible to "over index" the tables. Typically a lot of the columns will not need an index. The query optimizer can become "confused" if you give it too many options and this slows down queries. So "more" is not necessarily "better". What to index or not is application specific and I can't guide you further on that without more information.
Oh, if this is a new table. Create the indicies after you create the table! This will also have a huge impact on insert performance!
Update: I found my post with some benchmarks: Re^3: Efficient way to handle huge number of records?. Look at the links that this refers to for some example code. This DB has more than a million rows.