Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Hi All. I have two files and they look something like this.

file1 called Methylation.gtf

# Chromosome Gene Feature Start END Strand CG_cou +nt CHG_Count CHH_count Total_meth_count C_count Sequen +ce #Gm10 Glyma13g27990.1 CDS_1 31097691 31097752 + 1 + 1 6 8 9 TTAAGGAAATTATAAATCAAAAAATCAAATTTGACTCGTCACTGGAGT +ACTTGTCCATTTGG

And the second file (mc_11268_10.txt) which looks something like this

assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9

To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant

I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.

To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code

%methhash = (); use Benchmark; $t0 = Benchmark->new; use lib qw(.); use Tie::Autotie 'Tie::IxHash'; tie %methhash, "Tie::IxHash"; open(IN1,"Methylation.gtf"); @genome = grep{/^Gm10/} <IN1>; open(INPUT, "mc_11268_10.txt"); while($line = <INPUT>){ chomp($line); $count++; if($count == 1){ next; } else{ @meth = split("\t",$line); $meth[0] = "Gm".$meth[0]; foreach $genes(@genome){ chomp($genes); @gene_line = split("\t",$genes); if($gene_line[0] eq $meth[0]){ $start = $gene_line[3] ; $end = $gene_line[4]; if(($meth[1]>=$start)&&($meth[1]<=$end)){ if($meth[3] eq "CHH"){ ${$methhash{$genes}}[0]+=1; } elsif($meth[3] eq "CHG"){ ${$methhash{$genes}}[1]+=1; } elsif($meth[3] eq "CG"){ ${$methhash{$genes}}[2]+=1; } } else{next;} } else{next;} @gene_line = ();} @meth=(); } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; foreach $ge (sort keys %methhash) { @final = split("\t",$ge); if($final[10]==0){ $chh = $chg = $cg = "INF"; } else{ $chh = sprintf("%.3f",(${$methhash{$ge}}[0])/$final[10]); $chg = sprintf("%.3f",(${$methhash{$ge}}[1])/$final[10]); $cg = sprintf("%.3f",(${$methhash{$ge}}[2])/$final[10]); $total_count = sprintf("%.3f",((${$methhash{$ge}}[2])+(${$methhash +{$ge}}[1])+(${$methhash{$ge}}[0]))/$final[10]); } print FILEIT join("\t",(@final[0..4],$cg,$chg,$chh)),"\n"; $chh = $chg=$cg=$total_count = ''; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks

PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping


In reply to Comparing and getting information from two large files and appending it in a new file by perlkhan77

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (7)
    As of 2014-09-20 18:52 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (160 votes), past polls