This process would also benefit from being run, say, in a cluster. If your program accepted as parameters the starting and ending line-numbers that it is to process, then multiple independent instances of the same process, running on different machines, could each summarize a chunk of the file, doing it in the manner just suggested:
- Tally all of the unique values that occur in the chunk, with the number of times that they occur.
- Process the ranges-file once, totaling the value-counts found within that range.
- Output a file that can be merged with all the other cluster outputs once all the cluster instances have finished doing their job.
The last step will be a “classic merge,” because you know that each process read the ranges-file sequentially, and that it output either zero or one record for each range, in that order. Thus, the merge process (which generates an output that can also if need be be used as its input ...) might go something like this:
- Open the ranges-file and n not-yet-processed input files (including output of the last iteration of this program, if need be).
- “Prime the pump” by reading the first record from each input other than the ranges-file.
- Now, iterate through the ranges-file. For each range, examine all of the input-files which are not yet empty to see if they have a count for that range. If so, add it to the total and read the next record (if any) from that file. If no files contain the range, the answer for that range is zero. Continue until the ranges-file is exhausted.
Once again, merging is a purely-sequential process, and if there are many files to process they can be processed in stages by using the output of one stage as an input to the next iteration.
Run as many instances as you can, provided that each instance never exhausts the amount of real, not virtual, memory that is available in the machine for it. Each instance reads each file sequentially. All of the tallying and range-comparison should occur entirely in memory and should make maximum use of memory as an “instantaneous file.” Each process should exhibit a large working-set size but no page-outs.