Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re: Processing ~1 Trillion records

by DrHyde (Prior)
on Oct 25, 2012 at 11:08 UTC ( #1000828=note: print w/replies, xml ) Need Help??

in reply to Processing ~1 Trillion records

A trillion records? A trillion *bytes* is roughly 1TB. Let's assume that your records are, on average, 32 bytes - they're probably bigger, but that doesn't really matter. So you need to read 32TB, process it, and write 32TB. I don't think it's at all unreasonable to take 16 days for 64TB of I/O.

As you've been told, you need to profile your code. You actually need to do that for any performance problem, not just this one.

I, of course, have not profiled your code, and so everything from here on is mere speculation, but I bet that you are I/O bound. You have at least three places where I/O may be limiting you. Reading from the database (especially if you've only got one database server); transmitting data across the network from the database to the machine your perl code is running on; and writing the CSV back out. At least the first two of those can be minimised by partitioning the data and the workload and parallelising everything across multiple machines. You *may* be able to partition the data such that you can have seperate workers producing CSV files too.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000828]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2017-03-28 11:06 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (330 votes). Check out past polls.