http://www.perlmonks.org?node_id=951516

andromedia33 has asked for the wisdom of the Perl Monks concerning the following question:

hi all, i have a problem that i need urgent solutions for. i am doing some calculations along which i'm building up a huge hash, estimated to be a few GB. The hash consists of roughly hundreds of thousands of keys, each key pointing to an array of values, and i'm just appending values to these arrays pointed by the hash key while i do my calculation. This of course cannot be handled by the memory of a normal PC.

i then tried to tie the hash to a file in the hope of not using up all the memory, but it becomes unacceptably slow. based on my limited understanding, this is a classical memory vs. speed issue.

i then thought of the following: do a few steps of calculations at a time, and output a file that writes the sorted hash (by key) based on these calculations. so i end up with a few 500MB files, each storing part of the huge hash with the keys sorted.

now my question is, how to merge all these files without exhausting all my memory, or taking a month to complete?

i have kept a separate master_hash that only contains the sorted keys of the huge hash without its values. if i tie to one small hash file at a time and extract values of the keys in order, it's way too slow. may i have some suggestions please? tried all i can think of but it's still taking more than a week just to put together one huge hash. many thanks!!

  • Comment on how to merge many files of sorted hashes?

Replies are listed 'Best First'.
Re: how to merge many files of sorted hashes?
by GrandFather (Saint) on Feb 02, 2012 at 20:08 UTC

    It may help to modify your many file plan by instead using a database.

    If you can show us a demonstration version of the code and a small sample of the input data we may be able to show you how to handle the database side of a solution. An even better trick may be to make a sample data generator so we can scale the test data input according to our degree of patience.

    True laziness is hard work
      thanks for your reply! i start with 100 points each with x,y,z coordinates.
      4.941 32.586 -1.772 15.354 22.823 10.556 -0.495 12.345 98.234 ... $block_size = 1000; $counter = 0; %small_hash = (); foreach @triplet of the 100 points{ $counter++; my @matrix = ortho(@triplet); foreach $point in the 100 points{ my $key = binning(@matrix * $point); push @{$small_hash{$key}},@matrix; } #binning causes many points to fall in the same grid, #thus the array of values (matrices) for each key if($counter % $block_size == 0){ write out %small_hash; %small_hash = (); #frees the memory } } # now merging small hash files foreach $key (sort keys %master_hash){#keys for master_hash stored loop thru all small files{ while(<FILE>){ if($_ =~ /^$key/){ write values to the master output file; last; }#if }#while }#go thru files }
      so that's roughly the structure (actual code too long to post here). the first half is really fast to compute, but once i start merging files it takes forever. thanks a lot!

        All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.

        Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.

        True laziness is hard work
        so let's see if I'm reading this right:
        • you have 100 points
        • for every possible triple (1,000,000 combinations) you're computing a matrix
        • each matrix gets pushed 100 times (onto 100 different keys)
        • you start a new file every 1000 matrices (so there'll be 1000 files when you're done, and each file will have 100,000 matrices.
        but you don't say how many distinct keys there are (i.e., how finely binning is grouping things).

        But it doesn't matter. For every single key you are doing a linear search through every file. So that's (number of keys) * 100,000,000 matrices to wade through; no wonder this is slow.

        You need to be writing each %small_hash in sorted order. Sorting 1000 small_hashes will be faster than sorting the big hash (1000*(n/1000)*log(n/1000) = n log(n/1000) < n log(n) but the really important thing is this will enable the merge I was talking about before, which only reads each file once rather than once for every key value.

Re: how to merge many files of sorted hashes?
by wrog (Friar) on Feb 02, 2012 at 22:06 UTC
    if the goal is to have everything in one big file ordered by key, and you have several smaller 500MB files already ordered by key, then you just want to do a straight merge:
    1. open all of the files
    2. read a key from each file
    3. sort the filehandles by key
    4. from the file corresponding to the smallest key,
    5. read that value,
    6. copy the key and value to the output
    7. read the next key from the same file,
    8. move that filehandle to the place on filehandle list corresponding to the new key (or just sort the list again, if it's really short)
    9. go back to 4 and repeat until all of the files are exhausted.
    This should use extremely small amounts of memory — you're only ever keeping n filehandles and n keys in memory at any given time and every file is being read sequentially, which is the fastest way you can do things, diskwise.

    On the other hand, I'm still not clear on why you'd want everything in one file; much depends on how you're going to be using this file thereafter.

    You may do just as well to, instead of copying the value out in step 6, just call tell() to get a disk position and record that instead. That way you can have a master file that associates every key with a disk position and a value from 1..n indicating which file it is, and then you're not having to copy any files at all.

      thank you very much! this sounds exactly like what i'm looking for.
      on a second thought, maybe i do not HAVE to push everything inside a single file, but just keeping a few files. i need to think about this point, because i'll need to do pairwise comparison after constructing these hash tables (table against table) by looking for common keys. originally i thought that by having all keys sorted in a single file, it's really easy for me to make such comparisons. but maybe having several files can save a lot more cost at the table construction stage.
        Having seen your code here it's now clear that a given key can show up in multiple files and therefore you probably do want to be merging those values and only having the key appear in one place, which then means that a certain amount of copying is unavoidable.

        There's also an advantage to having output files that hold every key within a certain range, which then means your master file/table doesn't need to have a column saying which file a key is in (i.e., since you just infer it from the range the key is in), which then saves massive amounts of space...

Re: how to merge many files of sorted hashes?
by sundialsvc4 (Abbot) on Feb 03, 2012 at 14:52 UTC

    Sometimes, when I have “hairy things referring to other hairy things,” such as a matrix referring to another matrix, I find it useful to introduce the concept of surrogate keys.   This sort-of goes back to, “if I had to store all this stuff in an old-fashioned library card catalog, which drawer would I put it in and why?”   First of all, I would assign every matrix that I had (no matter how I intended to use it) a random unique identifier such as a UUID.   Then, I would produce some kind of arithmetic hash-value that would allow me to sift through all the data that I had in order to find it faster.   Something simple, like the sum of every number in the list after truncating that number to an integer.   I’d tag the information with that value, and then, look only for that tag.   (The original notion of “hashing.”)

    You don’t necessarily have to put those hundreds of megabytes of data into a database.   (SQLite, by the way, is an excellent tool for this.)   You just need to find a way to use a database file to catalog it ... to tell you which file it’s in, and where.   To reduce the amount of search time that you must spend to find a particular piece of information:   not reducing that time to zero, just reducing it.