http://www.perlmonks.org?node_id=1032061


in reply to Counter - number of tags per interval

Hi,

I thought again about your problem while I was doing something else and it suddenly came to my mind that it might be much faster if you did it just completely the other way around (depending naturally on your actual data).

Your program takes a lot of time because it has to loop over a large data structure for every single line of your very big file. What you could do instead is to start by reading the big file and summarize its content. My understanding is that you have a lot of repetitive information that you want to count. You could just parse the big file and record into an array or a hash the number of times each value comes up. This means you no longer have to do lengthy loops for each line of the big file

Once you've read the big file (or possibly just one ID section of it, say the 'a' section), you can use the array or hash summary against the intervals of the other file. This way, the nested loops of your program run against a presumably much smaller summary of the data and it might be much faster.

Of course, I have made some assumptions on the data which may turn out to be wrong. We have no idea of what your data looks like, which makes it very difficult for us to help you or even offer useful advice. In particular, I have assumed that the data summary could hold in an array or a hash, this may not be possible, it really depends on how repetitive your data is.

  • Comment on Re: Counter - number of tags per interval

Replies are listed 'Best First'.
Re^2: Counter - number of tags per interval
by sundialsvc4 (Abbot) on May 04, 2013 at 20:31 UTC

    This process would also benefit from being run, say, in a cluster.   If your program accepted as parameters the starting and ending line-numbers that it is to process, then multiple independent instances of the same process, running on different machines, could each summarize a chunk of the file, doing it in the manner just suggested:

    1. Tally all of the unique values that occur in the chunk, with the number of times that they occur.
    2. Process the ranges-file once, totaling the value-counts found within that range.
    3. Output a file that can be merged with all the other cluster outputs once all the cluster instances have finished doing their job.

    The last step will be a “classic merge,” because you know that each process read the ranges-file sequentially, and that it output either zero or one record for each range, in that order.   Thus, the merge process (which generates an output that can also if need be be used as its input ...) might go something like this:

    1. Open the ranges-file and n not-yet-processed input files (including output of the last iteration of this program, if need be).
    2. “Prime the pump” by reading the first record from each input other than the ranges-file.
    3. Now, iterate through the ranges-file.   For each range, examine all of the input-files which are not yet empty to see if they have a count for that range.   If so, add it to the total and read the next record (if any) from that file.   If no files contain the range, the answer for that range is zero.   Continue until the ranges-file is exhausted.
    Once again, merging is a purely-sequential process, and if there are many files to process they can be processed in stages by using the output of one stage as an input to the next iteration.

    Run as many instances as you can, provided that each instance never exhausts the amount of real, not virtual, memory that is available in the machine for it.   Each instance reads each file sequentially.   All of the tallying and range-comparison should occur entirely in memory and should make maximum use of memory as an “instantaneous file.”   Each process should exhibit a large working-set size but no page-outs.