Re: Solve the large file size issue

It superficially appears to me that what you are setting out to do here is simply, “a merge.” If you know that you have two files which are sorted by an identical key, you can write very efficient logic to process the two files. Or, quite likely, you can find an already-existing CPAN module that does this. (Sort::Merge and File::MergeSort both look interesting.)

If you want to “code your own” solution, here’s how I presented a solution to my COBOL classes, all those years ago. (Utterly ignoring the textbook’s complicated examples.) Use a state-machine approach: first, figure out what state you’re in, then do the right thing. There are the following states: (in a “two files” scenario)

Initial state: nothing has been read from either file yet.
Final state: STOP RUN.
You have reached the end of both files. (Therefore, switch to final-state.)
You have reached the end of file #1 but not file #2.
You have reached the end of file #2 but not file #1.
You have records from both files, and the keys are identical.
... and the key from file #1 is smaller.
... and the key from file #2 is larger.

(koff, koff ...) Interesting stuff for a late-night community college class, yes, and nice because it can easily be extended to deal with any number of input files. But otherwise, this is “a thing already done.” This sort of data-processing has (literally ...) been done, and done very well, since the days of Herman Hollerith. Grab an existing, off-the-shelf CPAN module and use it.

The computer should positively scream through a “mere” 8 million lines, since it only has to make one sequential pass through the file(s) to produce the right answer. You should have your solution in, say, “worst case, a second or so...”


Perl-Sensitive Sunglasses
	PerlMonks