This process would also benefit from being run, say, in a cluster. If your program accepted as parameters the starting and ending line-numbers that it is to process, then multiple independent instances of the same process, running on different machines, could each summarize a chunk of the file, doing it in the manner just suggested:
- Tally all of the unique values that occur in the chunk, with the number of times that they occur.
- Process the ranges-file once, totaling the value-counts found within that range.
- Output a file that can be merged with all the other cluster outputs once all the cluster instances have finished doing their job.
The last step will be a “classic merge,” because you know that each process read the ranges-file sequentially, and that it output either zero or one record for each range, in that order. Thus, the merge process (which generates an output that can also if need be be used as its input ...) might go something like this:
- Open the ranges-file and n not-yet-processed input files (including output of the last iteration of this program, if need be).
- “Prime the pump” by reading the first record from each input other than the ranges-file.
- Now, iterate through the ranges-file. For each range, examine all of the input-files which are not yet empty to see if they have a count for that range. If so, add it to the total and read the next record (if any) from that file. If no files contain the range, the answer for that range is zero. Continue until the ranges-file is exhausted.
Once again, merging is a purely-sequential process, and if there are many files to process they can be processed in stages by using the output of one stage as an input to the next iteration.
Run as many instances as you can, provided that each instance never exhausts the amount of real, not virtual, memory that is available in the machine for it. Each instance reads each file sequentially. All of the tallying and range-comparison should occur entirely in memory and should make maximum use of memory as an “instantaneous file.” Each process should exhibit a large working-set size but no page-outs.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||