http://www.perlmonks.org?node_id=382099


in reply to Re: size on disk of tied hashes
in thread size on disk of tied hashes

Thanks for your input, everybody. Wow, tilly, that was a heck of a post.

I know how I'm going to do this now, so I might as well explain it (so you don't all think I'm a pointless waste of time!)

I have worked out the data to be under 50 million datapoints. For the initial sorting I'm going to borrow two gigs of ram to minimize hard drive hits (too bad I can't add space to the L1 and L2 caches... yes, it's going to run on x86, yes, I know that's suboptimal). I'm also going to borrow a second, large harddrive, and throw in a small one I have handy.

The sorting algorithm, roughly, will be:

-Save the data to the borrowed HD. Yeah, I know, it might take a while - that's OK, I've got no choice on this one. Save it to flat files of an appropriate size (to be empirically tested first)

-Split the KVPs. That is, read each key/value pair out, append the key to a keylist (with 'pointer' data) and the data to a data file. The keys file will be on the small HD, the data will be split among several files (obviously <2GB) on the purchased HD. This splitting will proceed at the slowest of the HD's read/write accesses, but since they'll all be sequential there will be very good (write or read)/page access and page access/cache ratios, effectively approaching the maximum read/write speed of the slowest HD. This requires trivial CPU and RAM, so they aren't considerations.

-sort. Not the data, obviously, I'm going to leave that well enough alone for now. The keys and their new (much smaller) 'pointers' (64 bits/8 bytes will do it no problem, split into a file part and an offset part) will be sorted in a B-tree. The tree unfortunately will have a bit of a performance problem, taking only slightly better than k*n*log(n) time to complete (assuming berkely does early splitting, which is probable). Since these accesses will be effectively random, that's potentially quite a bit over 50 million HD cache misses. At 8.5ms/miss (the small one actually is a barracuda, so thanks for that link, tilly!) plus an average rotational latency of ~5.5ms, that's 14ms/miss. That's three years. Gah.

Thankfully, there's a workaround - I've done some quick calculations on sample sets of data, and the keys (yes, there will be duplicates) are almost all under 16 characters, and I haven't seen any over 32. Note that I said characters! Since I don't mind getting a few false positiveS, these can (reasonably quickly) be converted into a binary representation (throwing everything that's not a char out, and lowercasing all chars, then subbing from a to get a number that's expressable in 5 bits). With 64 bit keys, in 32 bytes I can express keys of up to 38 'characters' - perfect! B-trees shouldn't need more than 8 bytes per node (prev, next, lt, gt, each with 64bits), so each element will be expressable in 40 bytes. With 2G of RAM, very little of the data will have to be on the disk (I'm going to have to look into semi-memory-mapped files, unfortunately). Huzzah! Now my sort time is large, certainly, but easily under a day, especially considering the good cache hit ratio for the 'top' of the tree. Once done, writing to the HD at half it's max sustained write speed (I figure this is a fair metric) is 2G/20Mbytes/s = under two minutes! Excellent.

So by this point the data will be on the correct hard drive, the keys will be on the smaller one, and the memory and spare HD can be returned. Cool.

Access time? No problem. With a width of four (less than what Berkely acutally uses, I believe) this will take under 8 hard drive hits, given a balanced tree. That's a tenth of a second, which admittedly could be better, but ah well. Once you consider that the top of the tree is probably going to stay in the hard drive's cache, this will probably be around two thirds to half of that on average. Then there's the hit to the data drive, which will be around or under 20ms. Sure, this is slow, but (in the beginning - oh god, don't ask me how I'm going to scale it if/when the time comes) usage probably won't be over 1/s, so we're good.

And... the kicker. Updates! Updates will be painful - and at the rate of around 30,000/day, give or take, that's one every three seconds! Fortunately the insertion will only be one HD hit slower than searches, so it'll be OK - the hard part, which I won't go into here, is finding the correct 30k to update (long, long explanation).

So: it's doable. A hash would be great for it's constant-time access, but building and updating it would be painful (I think - I've dealt with lots of weird tree variations, but most of the hashing algorithms I've seen have been the trivial case, so it's entirely possible there's a good hash solution out there).

Replies are listed 'Best First'.
Re^3: size on disk of tied hashes
by BrowserUk (Patriarch) on Aug 12, 2004 at 12:26 UTC

    If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:

    Indexing 100_000_000, 160 byte records in a data file.

    1. The index requires 2.23 GB for 100 Million records. Data record size does not affect the index size. It is a fixed 24-bytes/record.
    2. Building and sorting the index: 3 hours (worst case; under 1 hour best);
    3. Accessing 10_000 randomly chosen records: Under 3 minutes.

      That's locating in the index entry and reading the record combined.

      Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial

      Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial

      Update: A 100,000 thrials just completed:

      100000 trials of binsearch ( 1,643s total), 16.437ms/trial

    4. Insert/delete* a record: Currently 1.2 seconds.

      This can be improved, I believe substantially.

      Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.

      * Deletion consists of removing the index entry and adding it to a "deleted.idx" file.

      The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.

    The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.

    The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.

    The datafile is compressed, the index not.

    For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re^3: size on disk of tied hashes
by tilly (Archbishop) on Aug 11, 2004 at 21:49 UTC
    Big suggestion. Your first step should be to take your data and do a mergesort to get it into order. Mergesorts sequentially process data in order, and therefore represent a best case for standard pre-fetch/caching algorithms used in filesystems and on disk. Inserting sorted data into a BTree again causes filesystem caching to do you a world of good.

    After that I'd expect your access speed to be better than you're planning on. Smart BTree implementations don't have a constant branch factor - instead they put as many branches into a page as they can. Even with the amount of data that you're facing, I think that a BTree should require no more than 5 page accesses to pull your data. And better yet, the root pages are usually going to be in memory so you'll probably average under 2 hits to disk for data. (Plus if there is some locality of reference in how it is used, then it could be less.)

    As for scaling, you can get a lot out of a bit of parallelism. The fact that one request needs to wait disk is no reason that another request cannot be being served as well. One good architecture for that looks like a dedicated server with several threads or processes - you connect to one of them and it serves you. Unbeknownst to you, others are doing the same. (I'd strongly recommend against using BerkeleyDB in a CGI environment because of problems with server start/stop. In a mod_perl environment is fine, but in a CGI environment there are potential issues that you don't want to hit.)

    Of course if you're going to talk about that, then you might as well upgrade to the invented wheel of a decent RDBMS with row-level locking. Modifying this recipe to an RDBMS, create your table, and create an index on it that will internally be a BTree. (How you do that varies by database - see a DBA and/or documentation.) Then take your dataset, do the mergesort on your own, and proceed to insert it into that table in order.

    If you've lined things up right, the load should proceed at reasonable speed, and when you're done the database has already figured out how to coordinate having multiple processes making requests of it at once.