http://www.perlmonks.org?node_id=951540


in reply to how to merge many files of sorted hashes?

if the goal is to have everything in one big file ordered by key, and you have several smaller 500MB files already ordered by key, then you just want to do a straight merge:
  1. open all of the files
  2. read a key from each file
  3. sort the filehandles by key
  4. from the file corresponding to the smallest key,
  5. read that value,
  6. copy the key and value to the output
  7. read the next key from the same file,
  8. move that filehandle to the place on filehandle list corresponding to the new key (or just sort the list again, if it's really short)
  9. go back to 4 and repeat until all of the files are exhausted.
This should use extremely small amounts of memory — you're only ever keeping n filehandles and n keys in memory at any given time and every file is being read sequentially, which is the fastest way you can do things, diskwise.

On the other hand, I'm still not clear on why you'd want everything in one file; much depends on how you're going to be using this file thereafter.

You may do just as well to, instead of copying the value out in step 6, just call tell() to get a disk position and record that instead. That way you can have a master file that associates every key with a disk position and a value from 1..n indicating which file it is, and then you're not having to copy any files at all.

Replies are listed 'Best First'.
Re^2: how to merge many files of sorted hashes?
by andromedia33 (Novice) on Feb 03, 2012 at 01:14 UTC
    thank you very much! this sounds exactly like what i'm looking for.
    on a second thought, maybe i do not HAVE to push everything inside a single file, but just keeping a few files. i need to think about this point, because i'll need to do pairwise comparison after constructing these hash tables (table against table) by looking for common keys. originally i thought that by having all keys sorted in a single file, it's really easy for me to make such comparisons. but maybe having several files can save a lot more cost at the table construction stage.
      Having seen your code here it's now clear that a given key can show up in multiple files and therefore you probably do want to be merging those values and only having the key appear in one place, which then means that a certain amount of copying is unavoidable.

      There's also an advantage to having output files that hold every key within a certain range, which then means your master file/table doesn't need to have a column saying which file a key is in (i.e., since you just infer it from the range the key is in), which then saves massive amounts of space...

        i think the sort-merge join you explained earlier should solve my problem immediately (i didn't have an idea on how to do it). now i'm more debating on whether i need to incorporate this into a database because my programming is really beginner's level.
        the entire purpose of doing this is explained here. the estimated size of all hash tables is probably a few hundred GB.
        let me try your suggestion first. thank you very much for your help. really appreciate it.