Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Processing ~1 Trillion records

by aossama (Acolyte)
on Oct 25, 2012 at 09:48 UTC ( #1000794=perlquestion: print w/ replies, xml ) Need Help??
aossama has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am in charge of re-factoring a script which basically iterates over an array from a SELECT query from a database which returns ~1 Trillion records, the array restructures the items in a certain pattern and produces csv files. The script works fine, but the issue I am facing is that it takes ~16 days to finish. I have to decrease the run-time of the script; So these were the thoughts I had:
  • Grid Engine
  • Hadoop
  • Optimize the script
I appreciate any advice regarding any of the three topics, what modules to use, any past experience with such huge number of records, or any other solutions to optimize the performance of the script. Thanks in advance.

Replies are listed 'Best First'.
Re: Processing ~1 Trillion records
by Corion (Pope) on Oct 25, 2012 at 09:57 UTC

    First of all, I would look at tuning the database, and pushing part of the work into the database. Databases are usually good at aggregation, for example, but depending on your key space size, aggregation or sorting may be prohibitive on your database server unless the indices needed for that already exist.

    If working with the database is not a good option, I would write the query results to disk into one or more files, and simply launch the processing in parallel for each file. This approach relies on the fact that all your aggregators are symmetric. For example sum() and count() and max() are symmetric aggregators, because it doesn't matter in which order you visit the rows to find them.

    If you need more complicated aggregators, like TOP 10 or a HAVING filter, things will require much more thought - Google Sawzall and MapReduce have produced some papers on symmetric aggregators for other than the trivial stuff.

    I've used small SQLite databases to produce and store the intermediate results and ATTACHed the databases to create the final results from them. In my experience SQLite is not suitable for holding (and JOINing) data if the file size goes above 2GB, but that experience was with SQLite 2. Changes to the Btree backend may have improved that limit.

    Without knowing about how the data is fetched from the database and how the restructuring is to be done, it's hard to suggest a proper optimization. Just be warned that for example some tied hashes do not like to have more than 4GB keys.

Re: Processing ~1 Trillion records
by erix (Vicar) on Oct 25, 2012 at 09:52 UTC

    I'd say you first have to profile: where is that time spent: database retrieval, downstream processing, writing, (perhaps also writing back to database?), etc.

    16 days; there is probably room for improvement, so try cutting up that profile in as many parts as you can think of.

Re: Processing ~1 Trillion records
by DrHyde (Prior) on Oct 25, 2012 at 11:08 UTC

    A trillion records? A trillion *bytes* is roughly 1TB. Let's assume that your records are, on average, 32 bytes - they're probably bigger, but that doesn't really matter. So you need to read 32TB, process it, and write 32TB. I don't think it's at all unreasonable to take 16 days for 64TB of I/O.

    As you've been told, you need to profile your code. You actually need to do that for any performance problem, not just this one.

    I, of course, have not profiled your code, and so everything from here on is mere speculation, but I bet that you are I/O bound. You have at least three places where I/O may be limiting you. Reading from the database (especially if you've only got one database server); transmitting data across the network from the database to the machine your perl code is running on; and writing the CSV back out. At least the first two of those can be minimised by partitioning the data and the workload and parallelising everything across multiple machines. You *may* be able to partition the data such that you can have seperate workers producing CSV files too.

Re: Processing ~1 Trillion records
by marto (Bishop) on Oct 25, 2012 at 09:59 UTC

    I second the advice of benchmarking/profiling your existing code, find existing bottlenecks. Perhaps the script in question is inefficient, profiling will reveal what's going on. Devel::NYTProf/Debugging and Optimization.

Re: Processing ~1 Trillion records
by dHarry (Abbot) on Oct 25, 2012 at 10:06 UTC

    Can you give some background information of the problem at hand?

    What database do you use? What does the query look like?

    Restructuring a massive array and writing CSV files sounds costly. What pattern? How many CSV files?

    Profiling a (small) subset of the data might give some indication. Did you time how long the various steps take?

    In short: could you show some code please?

Re: Processing ~1 Trillion records
by BrowserUk (Pope) on Oct 25, 2012 at 23:29 UTC

    Why are you bothering (and paying the transport & memory costs) to query 5 fields for every record, when you know that 3 of those fields are always going to be '9', 'STR', & 'STR1' respectively?

    Why build a 2-level hash:

    $data{ $dat[0] }{ $dat[1] } = $tmarker;

    when the first level is always '9'.

    Ditto this:

    $lines{ $dat[0] } = "";

    And why are you doing this:

    if( !defined $dat[3] ) { $dat[3] = 0; } if( !defined $dat[4] ) { $dat[4] = 0; }

    When the query dictates that fields 3 & 4 can never be undefined.

    What is the point of this:

    $datachr{ $dat[1] }{ $dat[3] } = $dat[4];

    when $dat[3] & $dat[4] are constants?

    Ditto this:

    $chr{ $dat[3] } = "";

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: Processing ~1 Trillion records
by sundialsvc4 (Abbot) on Oct 25, 2012 at 12:47 UTC

    Obviously, the algorithm that you present probably is not the real one, but this simply looks to me like something that ought to be able to benefit from sorting and/or grouping on the database level.   Is there truly nothing that you can do in that query in order to produce aggregated results?   Also:   you still must have a breakdown of the timing, even if you simply print the time-of-day to STDERR at the point at which the query-prepare is finished and the point at which the first row of data is produced.   I’ve got several terabytes of data storage on my computer right here, and even though it takes a while to move that much data around, and even though it’s not squirting through a large TCP/IP network, I don’t believe for a second that “16 days” can’t be very significantly improved upon.

    You should also, just to be sure, explain that query (since it doesn’t use the verb inner join), to make sure that it’s not doing something absolutely insane such as a Cartesian product at any point.   (16 days ... what would do that?   Anything along those lines would.   If there are no indexes, you probably just found your problem, and explain would confirm or deny it.)

    Probably the number-one improvement would be any way whatsoever by which you can prevent all that data from being transmitted.   The second would be to avoid a massive hash that must accumulate before its contents can be dumped.   For instance, if the data were or could be indexed by what you call “marker,” then you could select distinct a list of those markers and process them one at a time, perhaps in parallel.   It would no longer have to grind away for 16 days without producing anything and at the ever-present risk of producing nothing at all.   If meaningful, it might be able to say, “I already have that file, and it looks like I don’t need to produce it again.   (If the data were stored back in a table rather than a CSV, the server might be able to do that with the help of a join ... and the whole process might conceivably become the candidate for a stored-procedure or for a process running directly on the database-server, thereby avoiding across-the-network I/O.

      Actually the algorithm presented is the real one I am using. I don't have access to the data on the database level. I am profiling the script right now and checking the bottlnecks. Also I am trying to use an intermediate tmp database as Jenda said, and trying to use Redis as well.

        Okay... after explaining the query to see how the DBMS actually is approaching it, I would check for indexes and then consider doing a select distinct query to retrieve all of the unique keys.   Then, issue a query for each marker in turn, possibly splitting-out that work among processes, threads, or machines.   In this way, each file can be completely finished and the data in-memory disposed of in anticipation of the next request.

        Seemingly innocuous calls such as keys can be surprisingly expensive, as can sort, when there are known to be a prodigious amount of keys involved.   Hence, I would measure before doing serious recoding.

        “6 days” is such an extreme runtime ... that’s an intuition-based comment ... that there will most certainly turn out to be “one bugaboo above all others,” such that this is the first place and quite probably the only place that will require your attention.

        Why don't you have access at the database level?
        my $dbh = DBI->connect("; sid=sid; po +rt=1521", "username", "password")
        This contains all the relevant details to connect via sqlplus or SQL Developer. I don't mean any disrespect, but extracting such a large amount of rows and processing them by hand (or any other tool) can only be several orders of magnitude slower than doing it in SQL or at least PL/SQL. Could you explain the requirement, even in abstract form, here or on Oracle's forum? I'm sure people would come up with suggestions for an improved query.
Re: Processing ~1 Trillion records
by salva (Abbot) on Oct 25, 2012 at 10:07 UTC
    Without seeing the script it is impossible to say how it could be optimized!
      I am very sorry guys, but I can't show the exact code as it's confidential, but it goes like this.
      # use DBI oracle my $dbh = DBI->connect("; sid=sid; po +rt=1521", "username", "password") or die "Can't connect to database $ +DBI::errstr\n"; # The SELECT query # This query return ~9,450,000,000 records $sql_chr = qq{select column_a, g.column_b, column_c, column_d, column_ +e from table_f f, table_g g where f.column_a = g.column_b and column +_c !='-/-' and column_d= 'STR' and column_e='STR1' and column_a = 9}; my $sth_chr = $dbh->prepare( $sql_chr ); if($sth_chr->execute) { my $s=1; while (my (@dat) = $sth_chr->fetchrow) { ##if ($dat[2]) {$tmarker=translate($dat[2]);} ## else { $tmarker="00000"; } $tmarker=translate($dat[2]); #$data{$dat[0]}{$dat[1]} = $dat[2]; $data{$dat[0]}{$dat[1]} = $tmarker; $lines{$dat[0]} = ""; $markers{$dat[1]} = ""; $tmarkers{$tmarker}=""; #if (!defined $dat_chr1[1]) { $dat_chr1[1]=0; } if (!defined $dat[3]) { $dat[3]=0;} if (!defined $dat[4]) { $dat[4]=0;} #if ($dat[3] eq undef) { $dat[3]=0;} #print $dat_chr1[1]." here is the missing data\n"; #$datachr{$dat[0]}{$dat[1]}{$dat[3]} = $dat[4]; $datachr{$dat[1]}{$dat[3]} = $dat[4]; $chr{$dat[3]} = ""; # print "$s--> $cropid-->$dat[0]-->$dat[1]-->$dat[2]-->$dat[3] +-->$dat[4]-->$tmarker\n"; $s++; } } #### printing the files my $count_markers = keys %markers; my $count_lines = keys %lines; # my $count_chr=keys %chr; # print the header my $count_chr=1; my $checkchr = printmcdchr($csvfile, $count_markers, $count_chr); # print the content my $i=1; foreach my $m (sort {$a eq $b} keys %markers) { print OUT "$m $i\n"; $i++; }

        You seem to be accumulating lots of data in the hashes, are you sure it all fits in memory? As soon as you force the computer to swap memory pages to disk, the processing time grows insanely!

        It might help to tie the hashes to a DBM file (DB_File, MLDBM, ...) or use a SQLite or some other database to hold the temporary data. Doing as much work as you can upfront in the Oracle database would most probably be even though. Sometimes a use DB_File;tie %data, 'DB_File', 'filename.db'; is all you need to change something from unacceptably slow to just fine.

        Enoch was right!
        Enjoy the last years of Rome.

        If you only need to provide default values for your output, you can do that in SQL directly. I think Oracle has the NVL() function for that.

        If that is the only thing your program is doing, then the bottleneck is most likely Oracle, which is not really suited for quickly producing reporting outputs. There are more specialized databases for that, like Sybase IQ.

        Update: Thinking more about this, if your database is not idle during the 16 days, the transaction facilities of Oracle will also lessen the database speed. Oracle will try to present a coherent state of the database to each connection, and thus will keep rollback logs for the whole 16 days while your query runs. This creates additional overhead that may be slowing your query down. I would look at setting up a separate instance, or even a dedicated reporting database which has the appropriate indices for your queries instead of using the production database. Consider importing the data from the production database into your reporting database by restoring a backup of the production database. This also serves as a way to find out whether your backups can be restored at all.

        To quote Tim Bunce

        - while(@row = $sth->fetchrow_array) { } " one column: 51,155 fetches per second " 20 columns: 24,032 fetches per second - while($row = $sth->fetchrow_arrayref) { } " one column: 58,653 fetches per second - approximately 12% faster " 20 columns: 49,390 fetches per second - approximately 51% faster - while($row = shift(@$rowcache) || shift(@{$rowcache=$sth->fetchall_a +rrayref(undef, $max_rows)})) { } " one column: 348,140 fetches per second - by far the fastest! " 20 columns: 42,128 fetches per second - now slower than fetchrow_ +arrayref! " Why? Balance time saved making fewer calls with time spent managi +ng more memory " Do your own benchmarks to find what works best for your situation +s


        Timings I would think would be interesting would be the time taken in the $sth_chr->execute statement, the translate call (is it something that could be done more efficiently in the select call?), the time spent in the if(!defined($x)){$x = 0} statements vs having the DB pre-populate that for you, the time spent in the sorting of the keys, and anything else that shows up in a profiling run.

        This (as stated in other posts on this thread) is speculation, but I wonder if the sort could be sped up (if it is a bottleneck) by breaking the keys unsorted out, partitioning the data, sorting each individually, and merging the results. The other possibility, if it is the output that is the bottleneck and not the sort, is that the additional I/O would increase the time. Without profiling data it is impossible to know where to focus.


        You want to read up on bind_columns because it "is the most efficient way to fetch data" (according to the DBI docs).

        my $sth_chr = $dbh->prepare( $sql_chr ); $sth->execute; my %row; my $rv = $sth_chr->bind_columns( \( @row{ @{$sth->{NAME_lc} } } ) ); # + from the DBI docs while ($sth->fetch) { # access data with $row{column_a}, $row{column_b}, etc... }

        If the real query selects columns that are restricted to one value in the where clause then you can speed it up tremendously by not returning the static data. In the example query, you only care about column_c so you could remove everything else from the select (including the join). If this isn't the case, you might want to come up with a better example.

        Timings? Where do those 16 days go?

Re: Processing ~1 Trillion records
by BrowserUk (Pope) on Oct 25, 2012 at 21:16 UTC

    Could you run this version of your script and report back the output?

    # use DBI oracle my $dbh = DBI->connect( "; sid=sid; port=1521", "username +", "password" ) or die "Can't connect to database $DBI::errstr\n"; # The SELECT query # This query return ~945,000,000,000 records $sql_chr = qq{ select column_a, g.column_b, column_c, column_d, column_e from table_f f, table_g g where f.column_a = g.column_b and column_c != '-/-' and column_d = 'STR' and column_e = 'STR1' and column_a = 9 }; print time; my $sth_chr = $dbh->prepare( $sql_chr ); print time; if($sth_chr->execute) { my $now = time; print $now; my $stop = $now + 60; my $s=1; while( my ( @dat ) = $sth_chr->fetchrow ) { $s++; last if time() > $stop; } print $s; } print time;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Off the top of my head:

      1) You're sql query is doing an dynamic hash inner join. You may get better results making sure the joins (as well as your selection criteria) are indexed fields.

      2) You're essentially slurping in db records to markup the fields and dump it to files. If there is any way to can get around reading a trillion records into a hash (i.e., ORDER BY in the database... and the ordered fields are indexed) then you can read/markup/write the records retail without thrashing ram/swap.

      Well, that's my $.02 worth. No, for refunds you'll have to check our customer service department.
Re: Processing ~1 Trillion records
by mpeppler (Vicar) on Oct 26, 2012 at 06:43 UTC
    I've only glanced at the various answers quickly, so maybe I'm off the mark, but:

    My immediate reaction to needing to process that many rows is to try to parallelize the process. It will put a higher load on the DB, but that's what the DB is really good at. Obviously your dataset needs to be partitionable, but I can't imagine a dataset of that size that can't be split in some way.


      Also you need to be sure that it's able to produce results continuously over all those many days. If the program as-writ dies fifteen minutes before starting to write its first file (all data spewing out of RAM only at that point) the entire length of time is waste. No good.
Re: Processing ~1 Trillion records
by aossama (Acolyte) on Oct 25, 2012 at 10:55 UTC
    Sorry, the query returns ~945,000,000,000 not 9,450,000,000

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1000794]
Approved by Corion
Front-paged by marto
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2016-07-25 03:01 GMT
Find Nodes?
    Voting Booth?
    What is your favorite alternate name for a (specific) keyboard key?

    Results (221 votes). Check out past polls.