![]() |
|
Do you know where your variables are? | |
PerlMonks |
Re: Catching Cheaters and Saving Memoryby tilly (Archbishop) |
on Oct 13, 2006 at 05:19 UTC ( [id://578050]=note: print w/replies, xml ) | Need Help?? |
I think you need to think veeerrrry carefully about your data volumes and plan accordingly. Suppose that you have 5 billion lines with 1 billion threads. If the threads are completely evenly dispersed, that's 20 billion observations of x being in the same thread as y. (Note, since x may or may not vote on y, order matters.) Assuming that there is a high incidence of unique meetings (x meets y in one thread only), you'd have order of magnitude 20 billion records in our output set. At, say, 20 bytes per output line, that's 400 GB of data you want to spit out in your final report. Your temporary working space is going to be bigger than that, so you should plan on needing a terabyte or so. Is that a worst case scenario? Unfortunately no. Suppose that there are 500 million threads evenly spaced at 10 users per thread. Then there are 500 million * 10 * 9 = 45 billion observations. That would take over twice the space to represent. OK, so once you estimate the data volume and make appropriate preparations to not run out of disk space, let's discuss performance. The idea of using a dbm (eg BerkeleyDB) is a good one. However you're going to be doing tens of billions of reads then writes into this data structure. Let's assume 20 billion read/write pairs, and an average of 1 disk seek per write, and an average disk seek time of 5 ms. You're doing 200 seeks per second, 12,000/minute, 720,000/hour, 17.28 million per day. At that rate you'll take 1157.4 days to finish! Houston, we may have a problem. That estimate may be off by an order of magnitude either way, but your boss won't like even the best case. Now how can we speed this up? Well the performance problem here is due to latency, and you can parallelize your job. You can have multiple processes running in parallel on different parts of your file. You need some logic to coordinate locking. You can probably speed it up by an order of magnitude or so that way. A better trick is that you can put multiple machines to work, each one handling a different range of users. (Note, you'll want to consider all other users observed and voted on by a specific range of users.) Because the cost of skipping over lines in the file is minimal, you could have 10 machines, each of which is handling the votes from 1/10 of the users, and you'll get done approximately 10 times as fast. In fact you could scale this up quite a bit more than that if you have a large compute farm. If you don't have a compute farm, you could consider purchasing compute time from Amazon. There are also alternate strategies that might be faster. One that requires a lot more of you is doing a custom merge process. The idea here is similar to a merge sort, except that as you're merging, you may also group and sum. One way to implement it is is this:
I think you can see that this approach should take a lot more programming, but 6he end result is also likely to be a lot faster. Incidentally if you want data sorted, say from "most intersections" on down, sorting will be about as slow as this merging process was. I'd strongly recommend that before you do that final sort, you do a filter pass and only sort a fraction of your data. A final suggestion that is very, very important. Whatever approach you take, make it give you partial progress reports so you know how fast it is going! It is one thing to write a process that will take 2 years to finish on one machine, which can easily be parallelized and run in a couple of days for a couple thousand dollars on Amazon's computer cluster. It is quite another to write a process that will take 2 years to finish on one machine, leave it running for a month, and only then get worried about when it will be done! The difference between those two is having a program that tells you how quickly it is going so you can estimate how long it will take fairly early in the process.
In Section
Seekers of Perl Wisdom
|
|