Re^3: Processing ~1 Trillion records

in reply to Re^2: Processing ~1 Trillion records
in thread Processing ~1 Trillion records

Profile.

Timings I would think would be interesting would be the time taken in the $sth_chr->execute statement, the translate call (is it something that could be done more efficiently in the select call?), the time spent in the if(!defined($x)){$x = 0} statements vs having the DB pre-populate that for you, the time spent in the sorting of the keys, and anything else that shows up in a profiling run.

This (as stated in other posts on this thread) is speculation, but I wonder if the sort could be sped up (if it is a bottleneck) by breaking the keys unsorted out, partitioning the data, sorting each individually, and merging the results. The other possibility, if it is the output that is the bottleneck and not the sort, is that the additional I/O would increase the time. Without profiling data it is impossible to know where to focus.

--MidLifeXis

Comment on Re^3: Processing ~1 Trillion records Select or Download Code

Replies are listed 'Best First'.
Re^4: Processing ~1 Trillion records by sundialsvc4 (Abbot) on Oct 25, 2012 at 16:06 UTC
Indeed, if you can extract a list of what the so-called “markers” and “lines” are (which of course you can), then the problem decomposes itself quite-nicely into partitions. All that you really need are indexes on the columns in question. It might also be advantageous to define a `view` that corresponds to this query, so that you can examine and then optimize-the-hades-out-of that view. In some DBMSes, defining a view serves to introduce the query as one that it now knows to be frequently-used, and it certainly puts that “6 days long” requirement squarely into the DBA’s management field-of-view.	[reply]

Replies are listed 'Best First'.

Re^4: Processing ~1 Trillion records
by sundialsvc4 (Abbot) on Oct 25, 2012 at 16:06 UTC

Indeed, if you can extract a list of what the so-called “markers” and “lines” are (which of course you can), then the problem decomposes itself quite-nicely into partitions. All that you really need are indexes on the columns in question.

It might also be advantageous to define a view that corresponds to this query, so that you can examine and then optimize-the-hades-out-of that view. In some DBMSes, defining a view serves to introduce the query as one that it now knows to be frequently-used, and it certainly puts that “6 days long” requirement squarely into the DBA’s management field-of-view.

[reply]

In Section Seekers of Perl Wisdom