in reply to Netflix (or on handling large amounts of data efficiently in perl)
Shortly after the contest started I ran into the exact same problem you've described. Very early on while still in the backofthenapkin estimation phase I realized that using any kind of database whatsoever was completely out of the question since I don't have access to a supercomputer to even be able to store the entire matrix in memory all at once. The data is not exactly relationalsince it's just one giant, flat matrix all you're basically doing is counting. But to run any common statistical algorithms(lifted verbatim from "Numerical Recipes in C") requires an overwhelming number of multiplications and passes over the entire dataset, so any disk swapping(such as using mmap) would impose way too large of a time constraint. The matrix is extremely sparsesomething like 99.99% empty.
So I dropped down to C via XS and serialized the data to disk as raw binary files using the Compressed Column Storage algorithm, packing ints and doubles. I then put the binary files on Amazon's S3, and launched a few EC2 instances to handle separate chunks of the data files.
I thought calculating Pearson's correlation coefficient for every movie against every other movie(ditto for user against every user) might lead to a good result to start with, but completing this calculation would require something like 100 servers running for 70 years(or 1000 for 7 years).
Seeing as this was just an interesting side project to goof around with, I wasn't interested in racking up thousands of dollars in hosting expenses, so I gave up on this approach. It seems that most of the folks on the leaderboard have been using SVD, and I have no idea how they are actually computing this using desktopsmaybe I'm missing something obvious. But in all it was a fun learning experienceI had no idea it would end up being so complex along the way.
