|Perl: the Markov chain saw|
Netflix (or on handling large amounts of data efficiently in perl)by Garp (Acolyte)
|on Dec 24, 2008 at 04:16 UTC||Need Help??|
Garp has asked for the
wisdom of the Perl Monks concerning the following question:
I'm fairly much a n00b when it come to Perl, but as I find myself with a fair bit of spare time on my hands I'm wanting to learn to use it a bit more productively. I'm trying to go from just 'script hacking' stuff that aids sysadmin work to something a little more effective.
This is probably a classic case of running before I can walk, but I thought I'd look at the netflix challenge, not out of any naive belief that I'd be successful but because it gives me specific goals and stuff to get my teeth into. That and something about it is appealing to my inner geek.
So I hit the first hurdle, and sort of fell flat on my face. Specifically it's the sheer weight of data. I successfully imported that into MySQL and did some basic data mining there, but I started pondering about more efficient ways to handle the data, and whether it would be possible to speed everything up by storing the data in memory somehow for efficient access. MySQL is handy and all, but seems a pointless overhead for what is essentially static data.
For those unfamiliar with the challenge the data set contains over 100 million movie ratings (all integers, between, 1 -> 5), done by 17770 users on 480,189 films. User id's aren't contiguous, just to be helpful, and they can be between 1 and 2649429. I'm half inclined to map the netflix user ids to new sequential ones but I'm not sure if there would be any benefit from that as it would still be a long integer IIRC.
Data comes as a set of text files, one per movie, containing userid, rating and date. From what I've gathered Netflix's own engine doesn't pay attention to the date of the voting, nor do those of the top teams, correlation between the date of the vote and a user liking or disliking a film is negligible (hardly surprising.) So that leaves me with movie_id, user_id and rating. 480,189 x 17,770 = 8,532,958,530 data points at a size of 1 byte each, I assume, 8.5Gb of data! The reality is naturally that not every user has rated every film, hence the 100m figure above. If my math is remotely right thats only 95Mb of actual rating data, but then each vote will need to be paired to some form index.
The original data in plain text form takes up about 2Gb of disk space. Loading that into memory is just asking for trouble, and impossible given the constraints of the machine I've got access to (Windows Vista laptop with 2Gb RAM), let alone the data being inefficiently indexed.
One user in the challenge, called icefox, has produced an open source framework in C++ that manages the feat quite effectively (google "icefox netflix framework"). Data is stored in two files, one containing every vote and user sorted by movie id, and a second file filled with references to the starting location for each movie in the first file. This is then mmap'd into memory for use.
Does anyone have any tips, suggested reading or the like that would be a good way to start learning how to handle such large amounts of data, if the above approach is a poor idea; or indeed any critique or suggestions at all.