Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hi All. I have two files and they look something like this.

file1 called Methylation.gtf

# Chromosome Gene Feature Start END Strand CG_cou +nt CHG_Count CHH_count Total_meth_count C_count Sequen +ce #Gm10 Glyma13g27990.1 CDS_1 31097691 31097752 + 1 + 1 6 8 9 TTAAGGAAATTATAAATCAAAAAATCAAATTTGACTCGTCACTGGAGT +ACTTGTCCATTTGG

And the second file (mc_11268_10.txt) which looks something like this

assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9

To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant

I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.

To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code

%methhash = (); use Benchmark; $t0 = Benchmark->new; use lib qw(.); use Tie::Autotie 'Tie::IxHash'; tie %methhash, "Tie::IxHash"; open(IN1,"Methylation.gtf"); @genome = grep{/^Gm10/} <IN1>; open(INPUT, "mc_11268_10.txt"); while($line = <INPUT>){ chomp($line); $count++; if($count == 1){ next; } else{ @meth = split("\t",$line); $meth[0] = "Gm".$meth[0]; foreach $genes(@genome){ chomp($genes); @gene_line = split("\t",$genes); if($gene_line[0] eq $meth[0]){ $start = $gene_line[3] ; $end = $gene_line[4]; if(($meth[1]>=$start)&&($meth[1]<=$end)){ if($meth[3] eq "CHH"){ ${$methhash{$genes}}[0]+=1; } elsif($meth[3] eq "CHG"){ ${$methhash{$genes}}[1]+=1; } elsif($meth[3] eq "CG"){ ${$methhash{$genes}}[2]+=1; } } else{next;} } else{next;} @gene_line = ();} @meth=(); } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; foreach $ge (sort keys %methhash) { @final = split("\t",$ge); if($final[10]==0){ $chh = $chg = $cg = "INF"; } else{ $chh = sprintf("%.3f",(${$methhash{$ge}}[0])/$final[10]); $chg = sprintf("%.3f",(${$methhash{$ge}}[1])/$final[10]); $cg = sprintf("%.3f",(${$methhash{$ge}}[2])/$final[10]); $total_count = sprintf("%.3f",((${$methhash{$ge}}[2])+(${$methhash +{$ge}}[1])+(${$methhash{$ge}}[0]))/$final[10]); } print FILEIT join("\t",(@final[0..4],$cg,$chg,$chh)),"\n"; $chh = $chg=$cg=$total_count = ''; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks

PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping


In reply to Comparing and getting information from two large files and appending it in a new file by perlkhan77

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-20 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found