|Keep It Simple, Stupid|
Re: Comparing and getting information from two large files and appending it in a new fileby graff (Chancellor)
|on Mar 31, 2012 at 03:05 UTC||Need Help??|
Well, your first problem is that you are doing a lot of the same work over and over again on the @genome array every time you read a line from INPUT (the "mc_11268_10.txt" file). You should build a data structure for the "genome" information as you read the data from the first file, so that you can use the data more efficiently while reading the second file.
Since you seem to be using whole lines of the first file as hash keys, you could just load that hash while reading the first file. Since most of the effort when reading the second file is to locate a matching range for each line of data, the crux of the problem is to figure out how to do this as quickly as possible.
And, why do you use Tie::Autotie "Tie::IxHash"? Since you sort the keys to do your output, I think you could use just a plain old hash.
Given the sample data you've shown (which I think has a mistake: did you mean to put '#' at the start of both lines in the first file?), it's hard to tell what proper output should look like, because the posted data produces no output with the posted code. So, I can't really be sure whether the alternative below would do the right thing -- I can only tell that it produces the same result as the OP code when using the OP data samples.
I've taken the trouble to make it work with "use strict", just to show you that doing this really doesn't involve a lot of effort, and is well worth it. The logic is very different: build a couple hashes while reading the first file, and also create a sorted array of the "Start" values, to reduce the amount of work that needs to be done while reading the second file. Overall, this uses "split" a lot less often, and when looking to see which range a given INPUT line falls into, it stops looping over ranges as soon as the right one is found.
(UPDATE: Shortly after posting, I realized that my range-matching logic might produce lower counts in its output than yours -- in particular, if different "genome" records in the first file have overlapping ranges, and if a given input from the second file could match multiple ranges that happen to have different starting values, then my code will only increment the first matching range. It wouldn't be hard to fix, but if that condition never comes up, there's no need to fix it.)
If that produces output that isn't right, and if you can't grok how it works (or how it fails), you'll need to post some relevant data samples that produce output (both good and bad).