comment on

Hi All. I have two files and they look something like this.

file1 called Methylation.gtf

#    Chromosome    Gene    Feature    Start    END    Strand    CG_cou
+nt    CHG_Count    CHH_count    Total_meth_count    C_count    Sequen
+ce
#Gm10    Glyma13g27990.1    CDS_1    31097691    31097752    +    1   
+ 1    6    8    9    TTAAGGAAATTATAAATCAAAAAATCAAATTTGACTCGTCACTGGAGT
+ACTTGTCCATTTGG
[download]

And the second file (mc_11268_10.txt) which looks something like this

assembly    position    strand    class    mc    h
10    8691    +    CG    12    12
10    8692    -    CG    3    3
10    8705    +    CHG    7    18
10    8707    -    CHG    2    4
10    8717    +    CG    25    25
10    8718    -    CG    6    6
10    8984    +    CG    7    7
10    8991    +    CG    6    9
[download]

To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant

I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.

To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code

 
%methhash = ();
use Benchmark;
    $t0 = Benchmark->new;
use lib qw(.);
use Tie::Autotie 'Tie::IxHash';
tie %methhash, "Tie::IxHash";
open(IN1,"Methylation.gtf");
@genome = grep{/^Gm10/} <IN1>;
open(INPUT, "mc_11268_10.txt");
while($line = <INPUT>){
chomp($line);
$count++;
if($count == 1){
next;
}
else{
@meth = split("\t",$line);
 $meth[0] = "Gm".$meth[0];
foreach $genes(@genome){
    chomp($genes);
    @gene_line = split("\t",$genes);
    if($gene_line[0] eq $meth[0]){
    $start = $gene_line[3] ;
    $end = $gene_line[4];
        if(($meth[1]>=$start)&&($meth[1]<=$end)){
                if($meth[3] eq "CHH"){
            ${$methhash{$genes}}[0]+=1;
                        }
            elsif($meth[3] eq "CHG"){
                ${$methhash{$genes}}[1]+=1;
            }
            elsif($meth[3] eq "CG"){
                ${$methhash{$genes}}[2]+=1;
                    }
        }
        
        else{next;}
        }
        else{next;}
        @gene_line = ();}
        @meth=();
        }
        }
        open(FILEIT,">res_11268_10.txt");
print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b
+y_C CHG_count_by_C CHH_count_by_C)),"\n";
foreach $ge (sort keys %methhash) {
    @final = split("\t",$ge);    
    if($final[10]==0){
    $chh = $chg = $cg = "INF";
    }
    else{
    $chh = sprintf("%.3f",(${$methhash{$ge}}[0])/$final[10]);
    $chg = sprintf("%.3f",(${$methhash{$ge}}[1])/$final[10]);
    $cg = sprintf("%.3f",(${$methhash{$ge}}[2])/$final[10]);
    $total_count = sprintf("%.3f",((${$methhash{$ge}}[2])+(${$methhash
+{$ge}}[1])+(${$methhash{$ge}}[0]))/$final[10]);
    }
    print FILEIT join("\t",(@final[0..4],$cg,$chg,$chh)),"\n";
    $chh = $chg=$cg=$total_count = '';
}
    
    $t1 = Benchmark->new;
$td = timediff($t1, $t0);
print "the code took:",timestr($td),"\n";
[download]

Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks

PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping

In reply to Comparing and getting information from two large files and appending it in a new file by perlkhan77

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks