Clear questions and runnable code
get the best and fastest answer
Once it becomes clear that each of your two data files contains half of each of 50 independent data sets, it becomes immediately clear that the main source of time consumption in your processing is doing repeated lookups into your HoHoH.
Now, given that each section of your file2 (-i), consists of upto 7 billions lines of upto 12 characters, where each line represents a single bit of data; it becomes pretty obvious that you should be inverting the logic of your lookups.
That is; instead of building a huge, in-memory data structure (I estimate 4GB+ HoHoH), and then comparing each bit position against each of the ~60,000 ranges for that id; you should be building a single bitvector from the 5-7 billion bits in each section (<1GB) and then comparing each range against the small subset of bits it covers using vec.
In this way, you vastly reduce the memory requirement -- by only holding the 2% of lookup data you need in memory at any given time -- and also the combinatorial multipliers involved by only doing direct lookups for the (small) number of bits in each range against just those bits in the bitvector.
To build the bitvector from (1 section of) the positions file, you can do:
And then to compare the ranges against that bitvector, accumulate the counts and output the results is just:
Putting that together with an outer loop to process the 50 different datasets you get:
When run on a pair of files containing 1 dataset (600e6 0/1s and a full 60,000 ranges), building the bitstring took 20 minutes and checking the ranges took just 2 minutes.
Since my (randomly generated; thus possibly non-representative) dataset requires approximately 1/10 the work of one of yours I estimate that this would cut your overall run time to about 1/4.
Of course, with the simple expedient of pre-processing the datasets into 50 separate file pairs, you could could run 4 (8/16/64/256) datasets concurrently and divide your runtime by the equivalent factor.
Indeed, if you were to have the preprocessing step construct the bitstrings and save them to file(s), running the ranges against them would only take 2 or 3 minutes times the number of IDs divided by the number of processors you have. No time at all really.
(And potentially there are ways of reducing the time taken to construct the bitstrings. My current method invokes the regex engine for every bit, which with 7e9 bits is going to some time. In theory at least, the 0 or 1 you are interested in is always the (second)last character on the line which means it could be extracted using substr. And if every bit position is (sequentially) represented in the file, you do not need to parse and interpret the bit-position numbers.)
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In reply to Re: Counter - number of tags per interval (75%+ reduction in runtime)