Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Catching Cheaters and Saving Memory

by tilly (Archbishop)
on Oct 13, 2006 at 05:19 UTC ( [id://578050]=note: print w/replies, xml ) Need Help??


in reply to Catching Cheaters and Saving Memory

I think you need to think veeerrrry carefully about your data volumes and plan accordingly.

Suppose that you have 5 billion lines with 1 billion threads. If the threads are completely evenly dispersed, that's 20 billion observations of x being in the same thread as y. (Note, since x may or may not vote on y, order matters.) Assuming that there is a high incidence of unique meetings (x meets y in one thread only), you'd have order of magnitude 20 billion records in our output set. At, say, 20 bytes per output line, that's 400 GB of data you want to spit out in your final report. Your temporary working space is going to be bigger than that, so you should plan on needing a terabyte or so.

Is that a worst case scenario? Unfortunately no. Suppose that there are 500 million threads evenly spaced at 10 users per thread. Then there are 500 million * 10 * 9 = 45 billion observations. That would take over twice the space to represent.

OK, so once you estimate the data volume and make appropriate preparations to not run out of disk space, let's discuss performance. The idea of using a dbm (eg BerkeleyDB) is a good one. However you're going to be doing tens of billions of reads then writes into this data structure. Let's assume 20 billion read/write pairs, and an average of 1 disk seek per write, and an average disk seek time of 5 ms. You're doing 200 seeks per second, 12,000/minute, 720,000/hour, 17.28 million per day. At that rate you'll take 1157.4 days to finish!

Houston, we may have a problem. That estimate may be off by an order of magnitude either way, but your boss won't like even the best case.

Now how can we speed this up? Well the performance problem here is due to latency, and you can parallelize your job. You can have multiple processes running in parallel on different parts of your file. You need some logic to coordinate locking. You can probably speed it up by an order of magnitude or so that way.

A better trick is that you can put multiple machines to work, each one handling a different range of users. (Note, you'll want to consider all other users observed and voted on by a specific range of users.) Because the cost of skipping over lines in the file is minimal, you could have 10 machines, each of which is handling the votes from 1/10 of the users, and you'll get done approximately 10 times as fast. In fact you could scale this up quite a bit more than that if you have a large compute farm.

If you don't have a compute farm, you could consider purchasing compute time from Amazon.

There are also alternate strategies that might be faster. One that requires a lot more of you is doing a custom merge process. The idea here is similar to a merge sort, except that as you're merging, you may also group and sum. One way to implement it is is this:

  1. Process your input data and write out each session to a file, ordered by user responding, then other user seen in thread.
  2. Merge pairs of files, summing lines where appropriate.
  3. Repeat the merging process until you have one file sorted by user responding, then other user seen in the thread. (With 1 billion sessions this should take about 30 passes.)
(Technical note. I would not create a billion files. Gilesystems don't like that. I'd create 2 files, writing sessions alternately to one or the other. Then I'd merge them into 2 files with pairs of sessions. Then merge those into 2 files with groups of 4 sessions. And so on until was done.) Now how fast is that? Well let's take my original ballpark estimate of 400 GB of data. All of that data has to be read and written 30 times. That's 12 terabytes written, 12 terabytes read. At, say, 40 MB/s, that's 2.4 GB/min, 144 GB/hour, 3.456 TB/day, or about a week to run. (Note how much faster this is than the dbm approach!) Now you're hopefully maxing out your disk throughput so there is no gains from multiple processes, but you can still put multiple computers on it.

I think you can see that this approach should take a lot more programming, but 6he end result is also likely to be a lot faster.

Incidentally if you want data sorted, say from "most intersections" on down, sorting will be about as slow as this merging process was. I'd strongly recommend that before you do that final sort, you do a filter pass and only sort a fraction of your data.

A final suggestion that is very, very important. Whatever approach you take, make it give you partial progress reports so you know how fast it is going! It is one thing to write a process that will take 2 years to finish on one machine, which can easily be parallelized and run in a couple of days for a couple thousand dollars on Amazon's computer cluster. It is quite another to write a process that will take 2 years to finish on one machine, leave it running for a month, and only then get worried about when it will be done! The difference between those two is having a program that tells you how quickly it is going so you can estimate how long it will take fairly early in the process.

  • Comment on Re: Catching Cheaters and Saving Memory

Replies are listed 'Best First'.
Re^2: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 13, 2006 at 10:01 UTC

    Update: Link corrected. Thanks to OverlordQ.

    What do you think to a single perl script, ~20 hours and no additional storage requirements on commodity hardware?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Did you mean to link to here instead?
      I'd say that you're absolutely right to reframe the problem. And I'd like to think that I'd have thought of that if my wife hadn't been telling me to wrap up and get to bed. :-)

      In fact I was already on track for doing that in my comment about filtering the data set before sorting it. You just moved the filtering earlier...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://578050]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2025-06-22 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.