Re: how to merge many files of sorted hashes?

in reply to how to merge many files of sorted hashes?

if the goal is to have everything in one big file ordered by key, and you have several smaller 500MB files already ordered by key, then you just want to do a straight merge:

open all of the files
read a key from each file
sort the filehandles by key
from the file corresponding to the smallest key,
read that value,
copy the key and value to the output
read the next key from the same file,
move that filehandle to the place on filehandle list corresponding to the new key (or just sort the list again, if it's really short)
go back to 4 and repeat until all of the files are exhausted.

This should use extremely small amounts of memory — you're only ever keeping n filehandles and n keys in memory at any given time and every file is being read sequentially, which is the fastest way you can do things, diskwise.

On the other hand, I'm still not clear on why you'd want everything in one file; much depends on how you're going to be using this file thereafter.

You may do just as well to, instead of copying the value out in step 6, just call tell() to get a disk position and record that instead. That way you can have a master file that associates every key with a disk position and a value from 1..n indicating which file it is, and then you're not having to copy any files at all.

Comment on Re: how to merge many files of sorted hashes? Select or Download Code

Replies are listed 'Best First'.
Re^2: how to merge many files of sorted hashes? by andromedia33 (Novice) on Feb 03, 2012 at 01:14 UTC
thank you very much! this sounds exactly like what i'm looking for. on a second thought, maybe i do not HAVE to push everything inside a single file, but just keeping a few files. i need to think about this point, because i'll need to do pairwise comparison after constructing these hash tables (table against table) by looking for common keys. originally i thought that by having all keys sorted in a single file, it's really easy for me to make such comparisons. but maybe having several files can save a lot more cost at the table construction stage.	[reply]
Re^3: how to merge many files of sorted hashes? by wrog (Friar) on Feb 03, 2012 at 07:59 UTC
Having seen your code here it's now clear that a given key can show up in multiple files and therefore you probably do want to be merging those values and only having the key appear in one place, which then means that a certain amount of copying is unavoidable. There's also an advantage to having output files that hold every key within a certain range, which then means your master file/table doesn't need to have a column saying which file a key is in (i.e., since you just infer it from the range the key is in), which then saves massive amounts of space...	[reply]
Re^4: how to merge many files of sorted hashes? by andromedia33 (Novice) on Feb 03, 2012 at 16:35 UTC
i think the sort-merge join you explained earlier should solve my problem immediately (i didn't have an idea on how to do it). now i'm more debating on whether i need to incorporate this into a database because my programming is really beginner's level. the entire purpose of doing this is explained here. the estimated size of all hash tables is probably a few hundred GB. let me try your suggestion first. thank you very much for your help. really appreciate it.	[reply]
Re^5: how to merge many files of sorted hashes? by wrog (Friar) on Feb 03, 2012 at 22:47 UTC
Re^6: how to merge many files of sorted hashes? by andromedia33 (Novice) on Feb 03, 2012 at 23:54 UTC

In Section Seekers of Perl Wisdom