Did you take a look at PostgreSQL? It's used for datasets like OpenStreetMap.
perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
| [reply] [d/l] |
I was thinking of an SQL DB. My recommendation would be to start with the most simple thing, an SQLite DB. This doesn't have fancy multi-threading to process SQL commands, but it does use standard SQL and the code that you write for it can be used with more capable DB's like Postgress, etc. If SQL (pronounced "See-quel") is a foreign word to you, all the more reason to start with something simple. The installation, configuration and management of a DB server can get complicated. SQLite doesn't require any of that. In addition, it is possible to dynamically vary the memory footprint available to SQLite. I've used that feature before and it works. It will use the amount of memory that it has - it might get a lot slower with less memory, but it won't "blow up".
I have no idea of what your raw data looks like or what end report you are trying to generate. Showing failed Perl code is not enough. You will have to back up and explain the initial problem (show some abbreviated example data) and some example of an end result. Then perhaps I or other Monks can suggest an appropriate table structure.
Don't underestimate SQLite. One project that I'm working on now has 3 directories with 3,000+ files in each. Creating the table for each directory takes about 8 seconds for 1M rows. Processing each directories worth of data takes <2 seconds. I am telling you that 3M row DB is nothing. How many million lines do you have? and how much data is each line? It very could be that instead of a single complete "de-normalized" table, you will wind up with multiple tables that are related somehow. For one project, I wound up using the URL that the data came from as a link between 2 tables. Didn't have to be that way, but that was sufficient for that project. DB's often use integer values as "keys" that link tables, but it doesn't have to be that way.
I don't know enough to advise further.
UPDATE: re: "UPDATE: I had some luck with DBM::Deep where I can store and retrieve from the database however the write step seems to be very slow although it keeps the memory footprint acceptable."
The "write step" to a DB is fast. The "commit step" is slow and DB journaling does take time. I have never used DBM::Deep. With SQLite as with other SQL db's, you want to: $db->begin_work;, then do millions of inserts, then do $dbh->commit; A single transaction can have millions of rows. I do recommend switching to SQLite.
SQLite is written in C. DBM::Deep is written in pure Perl. SQLite is the most used DB by far in the world - its in your cell phone - its in your browser - its everywhere. There are many performance tweaks that can be done to SQLite - often at the expense of compromising ACID properties. However, sometimes this is completely appropriate. In my example referenced above, I could speed things up by stopping journaling, starting async writes, etc. In my case 30 sec is "fast enough" for me and I don't worry about DB creation. 3 vs 30 seconds is the same for me. The "big lifting" is making a million inserts one transaction. But I could make my app faster if I really wanted to or needed to (which I don't).
Recommendation: Switch to SQLite. Faster, more options, gigantic user base, very stable. It already comes with the DBI module, so it is already on your machine.
Another Update: Re: Memory usage - DBM::Deep is a Perl program. SQLite is a C program and it can and does play by different memory management rules. I am not sure if this limit still exists or not, but at one time SQLite was limited to 2GB of RAM. Its default is way, way less than that. In one application that I targeted for WinXP, I ran ran SQLite's memory up to the extravagant level of 500MB for one "expensive" indexing operation and then back down again after that. A Perl program cannot give memory back to the O/S, but SQLite can. None of this matters for DB creation, but if you needed say to make a histogram of a column with 50M rows, more memory probably would help. I would have to look back at previous code to find the commands for adjusting memory, but they do exist. My XP application was targeted at a machine which only had max of 1-2GB of RAM. For your app, I would run memory footprint up to 2GB and not worry about it after that.
| [reply] [d/l] [select] |
I appreciate the comprehensive reply Marshall. I elaborated on the data and the actual code in one of the responses to this thread.
The current project that is blowing the RAM is the smallest benchmark before scaling to the wider dataset and I can foresee issues with reading and retrieving despite MCE gettomg to the rescue (for now). Hence, I see the case for SQL/SQLite (which remained outside my toolbox since ages because of the bittersweet relationship I had with Oracle DBs).
The next step is to statistically analyse the final data through R. I am kinda apprehensively curious about how R might take reading the file.
In fact the DBM:Deep::Cookbook documents the slow performance
Because DBM::Deep is a conncurrent datastore, every change is flushed to disk immediately and every read goes to disk. This means that DBM::Deep functions at the speed of disk (generally 10-20ms) vs. the speed of RAM (generally 50-70ns), or at least 150-200x slower than the comparable in-memory datastructure in Perl.
There are several techniques you can use to speed up how DBM::Deep functions.
Put it on a ramdisk
The easiest and quickest mechanism to making DBM::Deep run faster is to create a ramdisk and locate the DBM::Deep file there. Doing this as an option may become a feature of DBM::Deep, assuming there is a good ramdisk wrapper on CPAN.
Work at the tightest level possible
It is much faster to assign the level of your db that you are working with to an intermediate variable than to re-look it up every time.
Something or the other, a monk since 2009
| [reply] [d/l] |
| [reply] [d/l] |
Ok, I looked at the data that you referenced. Below is some sample code to create an SQLite DB from the data that you showed. I have no idea whatsoever of what this data means, so I took some guesses. Each column has to have a unique name and I just put some guess in for a heading.
You can hardly blink fast enough before this code finishes. I would estimate that this code will take about 15*10 seconds or 150 seconds to create 15M line table. Ok, 2 1/2 minutes for table creation. Run this with your 15M line data set and see how long it does takes on your machine.
15M rows is "not big" as these things go. Reads are going to be much faster than writes. I could perhaps make the table creation run 2x as fast, but to what point? I think the real question is what processing of the data do you want once the table is created? I still don't understand that part.
Forget about MCE stuff for the time being. I have a 4 core machine. With a very compute bound job, I can use 4 cores and get the job done maybe 3.8x faster. At this point, focus on the order of magnitude improvements and getting something at a small scale to produce the result you want.
| [reply] [d/l] |