|laziness, impatience, and hubris|
Re^2: size on disk of tied hashesby danderson (Beadle)
|on Aug 11, 2004 at 21:21 UTC||Need Help??|
Thanks for your input, everybody. Wow, tilly, that was a heck of a post.
I know how I'm going to do this now, so I might as well explain it (so you don't all think I'm a pointless waste of time!)
I have worked out the data to be under 50 million datapoints. For the initial sorting I'm going to borrow two gigs of ram to minimize hard drive hits (too bad I can't add space to the L1 and L2 caches... yes, it's going to run on x86, yes, I know that's suboptimal). I'm also going to borrow a second, large harddrive, and throw in a small one I have handy.
The sorting algorithm, roughly, will be:
-Save the data to the borrowed HD. Yeah, I know, it might take a while - that's OK, I've got no choice on this one. Save it to flat files of an appropriate size (to be empirically tested first)
-Split the KVPs. That is, read each key/value pair out, append the key to a keylist (with 'pointer' data) and the data to a data file. The keys file will be on the small HD, the data will be split among several files (obviously <2GB) on the purchased HD. This splitting will proceed at the slowest of the HD's read/write accesses, but since they'll all be sequential there will be very good (write or read)/page access and page access/cache ratios, effectively approaching the maximum read/write speed of the slowest HD. This requires trivial CPU and RAM, so they aren't considerations.
-sort. Not the data, obviously, I'm going to leave that well enough alone for now. The keys and their new (much smaller) 'pointers' (64 bits/8 bytes will do it no problem, split into a file part and an offset part) will be sorted in a B-tree. The tree unfortunately will have a bit of a performance problem, taking only slightly better than k*n*log(n) time to complete (assuming berkely does early splitting, which is probable). Since these accesses will be effectively random, that's potentially quite a bit over 50 million HD cache misses. At 8.5ms/miss (the small one actually is a barracuda, so thanks for that link, tilly!) plus an average rotational latency of ~5.5ms, that's 14ms/miss. That's three years. Gah.
Thankfully, there's a workaround - I've done some quick calculations on sample sets of data, and the keys (yes, there will be duplicates) are almost all under 16 characters, and I haven't seen any over 32. Note that I said characters! Since I don't mind getting a few false positiveS, these can (reasonably quickly) be converted into a binary representation (throwing everything that's not a char out, and lowercasing all chars, then subbing from a to get a number that's expressable in 5 bits). With 64 bit keys, in 32 bytes I can express keys of up to 38 'characters' - perfect! B-trees shouldn't need more than 8 bytes per node (prev, next, lt, gt, each with 64bits), so each element will be expressable in 40 bytes. With 2G of RAM, very little of the data will have to be on the disk (I'm going to have to look into semi-memory-mapped files, unfortunately). Huzzah! Now my sort time is large, certainly, but easily under a day, especially considering the good cache hit ratio for the 'top' of the tree. Once done, writing to the HD at half it's max sustained write speed (I figure this is a fair metric) is 2G/20Mbytes/s = under two minutes! Excellent.
So by this point the data will be on the correct hard drive, the keys will be on the smaller one, and the memory and spare HD can be returned. Cool.
Access time? No problem. With a width of four (less than what Berkely acutally uses, I believe) this will take under 8 hard drive hits, given a balanced tree. That's a tenth of a second, which admittedly could be better, but ah well. Once you consider that the top of the tree is probably going to stay in the hard drive's cache, this will probably be around two thirds to half of that on average. Then there's the hit to the data drive, which will be around or under 20ms. Sure, this is slow, but (in the beginning - oh god, don't ask me how I'm going to scale it if/when the time comes) usage probably won't be over 1/s, so we're good.
And... the kicker. Updates! Updates will be painful - and at the rate of around 30,000/day, give or take, that's one every three seconds! Fortunately the insertion will only be one HD hit slower than searches, so it'll be OK - the hard part, which I won't go into here, is finding the correct 30k to update (long, long explanation).
So: it's doable. A hash would be great for it's constant-time access, but building and updating it would be painful (I think - I've dealt with lots of weird tree variations, but most of the hashing algorithms I've seen have been the trivial case, so it's entirely possible there's a good hash solution out there).