in reply to
Using less memory with BIG files
Extending on Moritz’s idea a little bit more, another trick is to scan the file stem-to-stern once, noting where the “important pieces” begin and end, and what the “key values” are that you will use when searching for those records. Insert the keys into a hash, with a file-position (or a list of file-positions) as the value. Then, after this one sequential pass through the entire file, you can seek() randomly to those positions at any time thereafter. (If along the way you have noted both the starting-position and the size of the entry, you can “slurp” any particular record into, say, a string variable fairly effortlessly.) This is a useful technique to apply to files that are “loosely” structured, as this one seems to be.
Now, if you happen to know that the two files are sorted, and specifically that they are sorted the same way ... if you can positively assert based on some outside knowledge that this is true, and that this always will be true, with regard to these files ... then your logic becomes a good bit simpler because you can simply read the two files sequentially and do everything in just one forward pass, just as they used to do when the only mass-storage device of any reasonable size that you had at your disposal was a tape-drive. It would be too-messy to sort them yourself, and maybe you do not want to risk that they might be, ahem, “out of sorts,” but it’s a handy trick to use (and, bloody fast ...) when you know that you can.