|Perl: the Markov chain saw|
Comparing and getting information from two large files and appending it in a new fileby perlkhan77 (Acolyte)
|on Mar 30, 2012 at 22:37 UTC||Need Help??|
perlkhan77 has asked for the
wisdom of the Perl Monks concerning the following question:
Hi All. I have two files and they look something like this.
file1 called Methylation.gtf
And the second file (mc_11268_10.txt) which looks something like this
To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant
I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.
To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code
Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks
PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping