Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comparing and getting information from two large files and appending it in a new file

by perlkhan77 (Acolyte)
on Mar 30, 2012 at 22:37 UTC ( #962693=perlquestion: print w/ replies, xml ) Need Help??
perlkhan77 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All. I have two files and they look something like this.

file1 called Methylation.gtf

# Chromosome Gene Feature Start END Strand CG_cou +nt CHG_Count CHH_count Total_meth_count C_count Sequen +ce #Gm10 Glyma13g27990.1 CDS_1 31097691 31097752 + 1 + 1 6 8 9 TTAAGGAAATTATAAATCAAAAAATCAAATTTGACTCGTCACTGGAGT +ACTTGTCCATTTGG

And the second file (mc_11268_10.txt) which looks something like this

assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9

To put things in Biological jargon the first file contains information about the gene locations and its corresponding features like CDS or UTRs in a particular chromosome along with the CG,CHH and CHG and C count in that particular feature of the gene. The Second file contains information of methylation sites and positions of a genotypic variant

I need to get all the methylation events that occur for each feature of each gene ie I need to compare if the second column (position) in the second file lies between the Start(4th column) and End(5th column) of the first file if it does then what is the class (4th column second file) to which it belongs then for each class encountered I increment the count for that particular gene feature combination by 1 and so on.

To accomplish this task I tried the following code. Which I admit is not following good coding practices and I apologize for any inconvenience caused due to it. Following is my code

%methhash = (); use Benchmark; $t0 = Benchmark->new; use lib qw(.); use Tie::Autotie 'Tie::IxHash'; tie %methhash, "Tie::IxHash"; open(IN1,"Methylation.gtf"); @genome = grep{/^Gm10/} <IN1>; open(INPUT, "mc_11268_10.txt"); while($line = <INPUT>){ chomp($line); $count++; if($count == 1){ next; } else{ @meth = split("\t",$line); $meth[0] = "Gm".$meth[0]; foreach $genes(@genome){ chomp($genes); @gene_line = split("\t",$genes); if($gene_line[0] eq $meth[0]){ $start = $gene_line[3] ; $end = $gene_line[4]; if(($meth[1]>=$start)&&($meth[1]<=$end)){ if($meth[3] eq "CHH"){ ${$methhash{$genes}}[0]+=1; } elsif($meth[3] eq "CHG"){ ${$methhash{$genes}}[1]+=1; } elsif($meth[3] eq "CG"){ ${$methhash{$genes}}[2]+=1; } } else{next;} } else{next;} @gene_line = ();} @meth=(); } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; foreach $ge (sort keys %methhash) { @final = split("\t",$ge); if($final[10]==0){ $chh = $chg = $cg = "INF"; } else{ $chh = sprintf("%.3f",(${$methhash{$ge}}[0])/$final[10]); $chg = sprintf("%.3f",(${$methhash{$ge}}[1])/$final[10]); $cg = sprintf("%.3f",(${$methhash{$ge}}[2])/$final[10]); $total_count = sprintf("%.3f",((${$methhash{$ge}}[2])+(${$methhash +{$ge}}[1])+(${$methhash{$ge}}[0]))/$final[10]); } print FILEIT join("\t",(@final[0..4],$cg,$chg,$chh)),"\n"; $chh = $chg=$cg=$total_count = ''; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Although the code does well if I test it with small portions of the second file and gives accurate result for that small file but when I run it with the whole second file which has around 2104595 lines it takes forever. Thus it would be great if anybody can suggest a more efficient way of solving this problem. Thanks

PS: Note that I have tried HOHOH to solve this problem and it is less efficient than this solution as it requires more looping

Comment on Comparing and getting information from two large files and appending it in a new file
Select or Download Code
Re: Comparing and getting information from two large files and appending it in a new file
by remiah (Hermit) on Mar 30, 2012 at 23:53 UTC
    Hello

    Maybe, you can save @genome array if you change
    foreach $genes(@genome){ chomp($genes);
    to
    while ( $genes =<IN1> ){ chomp($genes); next if ( $genes !~ /^Gm10/ );
    regards.
Re: Comparing and getting information from two large files and appending it in a new file
by aaron_baugher (Deacon) on Mar 31, 2012 at 02:27 UTC

    First of all, if you use consistent indentation, it'll be easier to tell where you're going wrong. Perl::Tidy can fix that up for you.

    Tie::IxHash gives you an ordered hash, which you don't appear to be using, since you sort by the hash keys later anyway. I don't know anything more about that module, but I assume it introduces some overhead, so you might as well stick with a normal hash.

    Your $count variable check seems to have no purpose except to skip the first line in your second file. That being the case, simply discard that line with <INPUT>; before you start your while loop, and you can remove that counter completely.

    But the big problem is @genome. You load your entire first file into this array, which may or may not be necessary. But then you process it, splitting each line, every time you process a line from your second file. This adds an order of magnitude to the processing time. The better way to do it would be to process the first file into a hash from which you can lookup that information, and do that preparation once -- before you start looping through your second file.

    I know you said you already tried a hash-of-hashes-of-hashes (I assume that's what HOHOH means) and it wasn't as efficient as this, but with all due respect, that's just not possible. This looks like a clear-cut case for "load file1 into a hash and check file2 against it." You might show us the code you tried in that case, so we can help you in that direction.

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

Re: Comparing and getting information from two large files and appending it in a new file
by graff (Chancellor) on Mar 31, 2012 at 03:05 UTC
    Well, your first problem is that you are doing a lot of the same work over and over again on the @genome array every time you read a line from INPUT (the "mc_11268_10.txt" file). You should build a data structure for the "genome" information as you read the data from the first file, so that you can use the data more efficiently while reading the second file.

    Since you seem to be using whole lines of the first file as hash keys, you could just load that hash while reading the first file. Since most of the effort when reading the second file is to locate a matching range for each line of data, the crux of the problem is to figure out how to do this as quickly as possible.

    And, why do you use Tie::Autotie "Tie::IxHash"? Since you sort the keys to do your output, I think you could use just a plain old hash.

    Given the sample data you've shown (which I think has a mistake: did you mean to put '#' at the start of both lines in the first file?), it's hard to tell what proper output should look like, because the posted data produces no output with the posted code. So, I can't really be sure whether the alternative below would do the right thing -- I can only tell that it produces the same result as the OP code when using the OP data samples.

    I've taken the trouble to make it work with "use strict", just to show you that doing this really doesn't involve a lot of effort, and is well worth it. The logic is very different: build a couple hashes while reading the first file, and also create a sorted array of the "Start" values, to reduce the amount of work that needs to be done while reading the second file. Overall, this uses "split" a lot less often, and when looking to see which range a given INPUT line falls into, it stops looping over ranges as soon as the right one is found.

    (UPDATE: Shortly after posting, I realized that my range-matching logic might produce lower counts in its output than yours -- in particular, if different "genome" records in the first file have overlapping ranges, and if a given input from the second file could match multiple ranges that happen to have different starting values, then my code will only increment the first matching range. It wouldn't be hard to fix, but if that condition never comes up, there's no need to fix it.)

    #!/usr/bin/perl use strict; use Benchmark; my %methhash; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^Gm10/ ) { my ( $bgn, $end, $methcount ) = (split /\t/ )[3,4,10]; $methhash{$_}{methcount} = $methcount; $methrange{$bgn}{$end} = $_; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $strand ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 and $position < $sorted_meth_starts[$indx] ) { $indx--; } next if ( $indx < 0 ); # skip if $position is lower than start of +lowest range my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { next if ( $position > $end ); # skip if position is outside th +is range my $match = $methrange{$bgn}{$end}; next if ( $match !~ /^$meth/ ); # skip if assembly doesn't mat +ch if ( $strand =~ /^(CH?G|CHH)$/ ) { $methhash{$match}{$strand}++; } } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
    If that produces output that isn't right, and if you can't grok how it works (or how it fails), you'll need to post some relevant data samples that produce output (both good and bad).

      The lines of the First file are long so they are being printed in multiple lines. Previously I only gave one line now I am including many lines in the following file sample for file 1 I hope file2 format is understandable.

      Gm10 Glyma10g00200.1 CDS_11 8569 8705 - 2 2 6 + 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAATGTCTTGATTT +GGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTAGCCTTTATTT +TATTTTGAATGTGTG Gm10 Glyma10g00200.1 CDS_10 8822 8822 - 0 0 0 + 0 0 G Gm10 Glyma10g00200.1 CDS_9 8885 9079 - 2 13 2 +3 38 50 GTATTTGTGTATTGCTCCTGGATCTAGACCTTCCCCTGTTGCTGCTGCTGCT +GCAAGTCACAAATTAATCACATGTATTTCATGCTTCGATGAGCGTTCACTTGCATATCATGCTGTTGGA +TATGGAAGAGGATCTCACATTCCAGCAGTAGCCATTACATCATCAGGCACTGCAGTTTCCAACCTTCTT +CCTGC Gm10 Glyma10g00200.1 CDS_8 9151 9255 - 2 2 11 + 15 19 GTTGGATCATGCCAATGAATTATCAAACTCCCTGAAAGAAAGTGCCAATATTA +ACACTGTTCGGGCATCGCTTATTGTTGAAGAATGCACAAGACTTGGTTTGAT Gm10 Glyma10g00200.1 CDS_7 11534 11698 - 1 7 +16 24 33 GGAATATACTGATCCTCACAAAGGTTCTCTTAACCTTGCTGTGAAGTTGCC +TAATGGTTCTCTAAAGCCAGACCTGGGGCCAAAAACATATATTGCTTATGGATTTCTTCAGGAGCTTGG +ACGTGGTGATTCAGTGACTAAGCTCCATTGTGATATGTCTGATGC Gm10 Glyma10g00200.1 CDS_6 12777 12823 - 0 0 +8 8 11 GCTTTTTACACAAATTTCATTCTTCACCTTCATTCTCATGAAGATTA Gm10 Glyma10g00200.1 CDS_5 12958 13022 - 1 4 +4 9 10 GCTGTAGATCTTCAGTACAAGTATTTAAGGCATTTTCAGTGGCATTGGGAAAA +GGGGGAACCGCT Gm10 Glyma10g00200.1 CDS_4 13348 13502 - 0 4 +20 24 33 GAGGATCCTTATCTTCTTCTTGGGTTCTGAAAAAATAAAAACATCAACTCC +CTCCTGTATTCATGCTTGGTGTTGGTTATGTGTCATGTCTGGTCTACTATTGACACATGTAAATCCAAA +TGCACTAGAGCTGCCCTCAAAAAACTCAATTTGGT Gm10 Glyma10g00200.1 CDS_3 13713 13829 - 0 2 +13 15 25 GATTGAGTTAGATGAGCTGGAAAGTGTTTCCATTCTGTCAATGACCCTTGC +ATGGGATGAATTTTCCTTCTCTACTTTTCAAGAAGCCCATTATTCACTTCAAGATTCTCTAGACCA Gm10 Glyma10g00200.1 CDS_2 13915 14209 - 8 11 + 40 59 88 GCTGCTCTTCTCCCTCAGCTGCTCTGTCTCCGGCGTTGACATTGGAGGAG +GGACTTGAGAAACTGAAGGAGGCTCTCCAAATCTTGAATTCTCCTTCCCCTTCTTCCCCTACTGGATTC +CTTAGGTTTCAGGTGGCGCTCCCTCCCAGTCCTAAGACCTTCACTTTGTTTTGCTCCCAACCCCACTCC +TCCTCGGTCTTTCCTCTCATTTATGTTTCCAAGAACGACGCCGACTCTAAATCACTCTATGTCAATTTG +GATGATAATGTGAGTCACCGGAAAGAAAGGTTCTTTTT Gm10 Glyma10g00200.1 CDS_1 14242 14286 - 2 1 +9 12 21 GATCTCGTTCCCCCTCACCACATAACACCCTGTCGTTCCCTTCAC Gm10 Glyma10g00210.1 CDS_7 15480 15722 - 4 7 +26 37 47 GGTACGATCAAACACATTGGGAGCTGTTGGATTCCTTGGGGATAGCAGAAG +AATCAATGTTGCCATCACAAGAGCACGCAAACATTTAGCTTTGGTCTGTGACAGCTCGACTATATGCCA +CAATACCTTCTTAGCAAGGCTTCTGCGTCATATTAGACACTTTGGTAGGGTGAAGCATGCAGAACCAGG +TAGTTTTGGAGGATATGGACTTGGGATGAATCCAATATTACCTTCCATTAATTA Gm10 Glyma10g00210.1 CDS_6 16625 16782 - 1 9 +15 25 32 GGTGTTAGCCCAACAGCTATTGCAGTGCAATCCCCTTATGTTGCTCAAGTA +CAACTTTTGAGGGACAAGCTTGATGAATTTCCAGAAGCAGCAGGTACTGAGGTTGCAACCATTGACAGT +TTTCAAGGTCGGGAAGCTGATGCAGTAATTTTATCCAT Gm10 Glyma10g00210.1 CDS_5 17595 17763 - 2 9 +16 27 34 GCCTACTTGGATAACACAATGCCCGCTGCTATTGCTAGATACTAGAATGCC +ATATGGAAGTCTGTCAGTTGGTTGTGAAGAGCATCTAGACCCGGCTGGAACAGGCTCACTTTATAATGA +AGGAGAAGCTGAGATAGTTTTGCAGCATGTATTTTCCTTAATCTATGCC Gm10 Glyma10g00210.1 CDS_4 18046 19077 - 12 42 + 91 145 173 GCGCAACTGTGAAGCTTTAATGCTGCTTCAGAAGAATGGTTTACGAA +AGAAGAATCCTTCAATTTCTGTTGTTGCTACACTGTTTGGAGATGGGGAAGATGTTGCATGGCTTGAGA +AAAATCATTTGGCTGACTGGGCAGAAGAAAAATTGGATGGAAGATTAGGAAATGAAACCTTTGATGATT +CTCAGTGGAGAGCAATTGCAATGGGTTTGAATAAAAAGAGGCCTGTATTGGTTATCCAAGGCCCTCCTG +GTACAGGCAAGACTGGTTTGCTCAAGCAACTTATAGCATGTGCTGTTCAGCAAGGTGAAAGGGTTCTTG +TTACAGCACCTACTAATGCAGCTGTTGATAACATGGTAGAAAAGCTTTCAAATGTTGGATTAAATATAG +TGCGGGTTGGAAATCCAGCTCGTATATCAAAAACAGTGGGATCAAAGTCTTTGGAAGAAATTGTAAATG +CTAAGCTTGCAAGTTTTCGAGAAGAGTATGAGAGGAAGAAGTCAGATCTAAGAAAAGATCTAAGACATT +GTTTAAGGGATGATTCACTAGCTTCAGGCATACGCCAACTTCTGAAGCAACTGGGAAGGTCACTGAAGA +AAAAGGAAAAGCAGACCGTAATTGAAGTTCTGTCTAGTGCTCAAGTTGTGGTTGCCACTAATACTGGAG +CAGCTGACCCTTTGGTTCGAAGGCTAGATACCTTTGATTTGGTTGTCATAGATGAAGCGGGACAGGCAA +TTGAACCCTCTTGCTGGATTCCTATATTGCAGGGAAAGCGCTGCATTCTTGCTGGTGATCAATGCCAAC +TTGCTCCTGTCATATTATCTAGAAAGGCCTTAGAAGTTGGTCTAGGAATATCTCTACTGGAGAGAGCTG +CAACTTTGCATGAAGGGATTCTCACCACTAGGTTAACAACACAATACCGTATGAATGATGCAATTGCTA +GTTGGGCTTCAAAGGAGATGTACGGAGGATTATTGAAGTCCTCTGAGACTGTCTTTTCTCATCTTCTAG +TAGACTCTCCTTTTGTTAA Gm10 Glyma10g00210.1 CDS_3 21109 21392 - 4 8 +30 42 52 GGACTTGGGGGAATGCATTTGGTGTTATTCAAGGTTGAAGGGAACCACCGG +TTACCACCAACCACTCTTTCACCTGGAGACATGGTTTGTGTGAGAACATATGACAGCATGGGTGCAATT +ACAACTTCTTGCATACAAGGATTTGTGAACAGTTTTGGGGATGATGGCTATAGTATTACAGTTGCTTTA +GAGTCACGCCATGGTGATCCTACATTCTCTAAACTATTTGGGAAGAGTGTGCGCATTGACCGTATTCAA +GGACTGGCTGATACACTTACTTATGA Gm10 Glyma10g00210.1 CDS_2 24354 24662 - 9 13 + 32 54 66 GCTCAGCATAGGGCGGTTGTGAGAAAGATAACGCAACCCAAGAGTGTCCA +AGGCGTTTTAGGAATGGACTTCGAAAAGGTCAAAGCATTACAGCACAGGATTGACGAATTCACCACCCA +TATGTCAGAACTACTCCGTATTGAAAGGGATGCTGAGTTGGAGTTTACTCAGGAGGAATTGGATGCTGT +TCCTAAACCAGATGATACTTCTGATTCTTCAAAAACGATTGATTTCTTGGTTAGCCATAGCCAGCCTCA +ACAAGAACTCTGCGACACCATTTGTAATTTAAACGCTATCAGTACCTCTACA

      There are a total of 11 tab separated values in each line 1.) Chromosome (in this case Gm10) 2.) gene id(or Gene) 3.) Feature( which could be of many types CDS_\d or 5'UTR_\d or 3'UTR_\d e.t.c 4.) Start position of the gene 5.) End position 6.) Strand type + or - 7 to 11 ) different counts of different types 12) the Sequence.

      Thanks graff for doing all the work. I'll try to see how your code works. But my other issue is I have 80 such files to compare with Methylation.gtf. I hope it can be done more efficiently. Thanks btw for all help appreciate it

      Hi Graff I was going over your code and I think there are few issues in the begining that you might suggest a way out of.

      $methrange{$bgn}{$end} = $_;

      Here you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code. Also I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here. What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??

        I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here.

        Given the larger sample of data that you gave for the first file, it's clear that the code in my initial reply wasn't working as intended. Sorry about that. I'll give it one more try and let you know how that comes out, but in the meantime...

        What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??

        Actually, when you've got a clear layout of the logical structure of the inputs, the decision process and the desired outputs, it's reasonably likely that a relational-table/SQL-based solution will help a lot. It has the same prerequisites as writing a perl script in order to get the right answers: having the right way to describe the task. Once you have that, it'll probably be a lot easier to write a suitable SQL statement to express the conditions and accomplish the operations that need to be done. If nothing else, just being able to deal with relevant subsets of the data at a time, rather than having to slog through one huge, monolithic stream, would be a win.

        I don't know enough about SQLite to say how good it would be with optimizing queries, but Mysql is easy to set up, easy to use, and quite effective for very large data sets (and it's very well documented). So long as you make sure to apply indexing on the table fields that get used a lot in queries, you should see pretty zippy results. You get a lot of built-in efficiency for free.

        ... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

        Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

        Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

        I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://962693]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (13)
As of 2014-08-20 20:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (123 votes), past polls