http://www.perlmonks.org?node_id=962693

perlkhan77 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All. I have two files and they look something like this.

file1 called Methylation.gtf

# Chromosome Gene Feature Start END Strand CG_cou +nt CHG_Count CHH_count Total_meth_count C_count Sequen +ce #Gm10 Glyma13g27990.1 CDS_1 31097691 31097752 + 1 + 1 6 8 9 TTAAGGAAATTATAAATCAAAAAATCAAATTTGACTCGTCACTGGAGT +ACTTGTCCATTTGG

And the second file (mc_11268_10.txt) which looks something like this

assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9

To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant

I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.

To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code

%methhash = (); use Benchmark; $t0 = Benchmark->new; use lib qw(.); use Tie::Autotie 'Tie::IxHash'; tie %methhash, "Tie::IxHash"; open(IN1,"Methylation.gtf"); @genome = grep{/^Gm10/} <IN1>; open(INPUT, "mc_11268_10.txt"); while($line = <INPUT>){ chomp($line); $count++; if($count == 1){ next; } else{ @meth = split("\t",$line); $meth[0] = "Gm".$meth[0]; foreach $genes(@genome){ chomp($genes); @gene_line = split("\t",$genes); if($gene_line[0] eq $meth[0]){ $start = $gene_line[3] ; $end = $gene_line[4]; if(($meth[1]>=$start)&&($meth[1]<=$end)){ if($meth[3] eq "CHH"){ ${$methhash{$genes}}[0]+=1; } elsif($meth[3] eq "CHG"){ ${$methhash{$genes}}[1]+=1; } elsif($meth[3] eq "CG"){ ${$methhash{$genes}}[2]+=1; } } else{next;} } else{next;} @gene_line = ();} @meth=(); } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; foreach $ge (sort keys %methhash) { @final = split("\t",$ge); if($final[10]==0){ $chh = $chg = $cg = "INF"; } else{ $chh = sprintf("%.3f",(${$methhash{$ge}}[0])/$final[10]); $chg = sprintf("%.3f",(${$methhash{$ge}}[1])/$final[10]); $cg = sprintf("%.3f",(${$methhash{$ge}}[2])/$final[10]); $total_count = sprintf("%.3f",((${$methhash{$ge}}[2])+(${$methhash +{$ge}}[1])+(${$methhash{$ge}}[0]))/$final[10]); } print FILEIT join("\t",(@final[0..4],$cg,$chg,$chh)),"\n"; $chh = $chg=$cg=$total_count = ''; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks

PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping