if the goal is to have everything in one big file ordered by key, and you have several smaller 500MB files already ordered by key, then you just want to do a straight merge:
- open all of the files
- read a key from each file
- sort the filehandles by key
- from the file corresponding to the smallest key,
- read that value,
- copy the key and value to the output
- read the next key from the same file,
- move that filehandle to the place on filehandle list corresponding to the new key (or just sort the list again, if it's really short)
- go back to 4 and repeat until all of the files are exhausted.
This should use extremely small amounts of memory — you're only ever keeping n
filehandles and n
keys in memory at any given time and every file is being read sequentially, which is the fastest way you can do things, diskwise.
On the other hand, I'm still not clear on why you'd want everything in one file; much depends on how you're going to be using this file thereafter.
You may do just as well to, instead of copying the value out in step 6, just call tell() to get a disk position and record that instead. That way you can have a master file that associates every key with a disk position and a value from 1..n indicating which file it is, and then you're not having to copy any files at all.