http://www.perlmonks.org?node_id=962807


in reply to Re^2: Comparing and getting information from two large files and appending it in a new file
in thread Comparing and getting information from two large files and appending it in a new file

... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

#!/usr/bin/perl use strict; use Benchmark; use Data::Dumper 'Dumper'; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^\s*Gm10/ ) { my ( $bgn, $end, $methcount ) = (split)[3,4,10]; $methrange{$bgn}{$end}{$_} = $methcount; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; my %methhash; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $class ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 ) { if ( $position >= $sorted_meth_starts[$indx] ) { my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { if ( $position <= $end ) { for my $match ( keys %{$methrange{$bgn}{$end}} ) { $methhash{$match}{methcount} = $methrange{$bgn +}{$end}{$match}; $methhash{$match}{$class}++; } } } last; } $indx--; } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
Here's my modified version of the gtf input (don't worry about the wrapped lines; if you click the "download" link at the end of the listing, you'll get the "original" version without line wrapping). Beware that the tabs may have been converted to spaces; also, the lines all start with a space -- the code posted above "does the right thing" with this file as-is, and will only fail on your own data if you have spaces embedded in some of your tab-delimited fields.
Gm10 Glyma10g00200.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g0test.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g00200.1 CDS_10 8822 8822 - 0 0 + 0 0 0 G Gm10 Glyma10g00200.1 CDS_9 8885 9079 - 2 13 + 23 38 50 GTATTTGTGTATTGCTCCTGGATCTAGACCTTCCCCTGTTGCT +GCTGCTGCTGCAAGTCACAAATTAATCACATGTATTTCATGCTTCGATGAGCGTTCACTTGCATATCAT +GCTGTTGGATATGGAAGAGGATCTCACATTCCAGCAGTAGCCATTACATCATCAGGCACTGCAGTTTCC +AACCTTCTTCCTGC Gm10 Glyma10g00200.1 CDS_8 9151 9255 - 2 2 + 11 15 19 GTTGGATCATGCCAATGAATTATCAAACTCCCTGAAAGAAAGT +GCCAATATTAACACTGTTCGGGCATCGCTTATTGTTGAAGAATGCACAAGACTTGGTTTGAT Gm10 Glyma10g00200.1 CDS_7 11534 11698 - 1 7 + 16 24 33 GGAATATACTGATCCTCACAAAGGTTCTCTTAACCTTGCTGTG +AAGTTGCCTAATGGTTCTCTAAAGCCAGACCTGGGGCCAAAAACATATATTGCTTATGGATTTCTTCAG +GAGCTTGGACGTGGTGATTCAGTGACTAAGCTCCATTGTGATATGTCTGATGC Gm10 Glyma10g00200.1 CDS_6 12777 12823 - 0 0 + 8 8 11 GCTTTTTACACAAATTTCATTCTTCACCTTCATTCTCATGAAG +ATTA Gm10 Glyma10g00200.1 CDS_5 12958 13022 - 1 4 + 4 9 10 GCTGTAGATCTTCAGTACAAGTATTTAAGGCATTTTCAGTGGC +ATTGGGAAAAGGGGGAACCGCT Gm10 Glyma10g00200.1 CDS_4 13348 13502 - 0 4 + 20 24 33 GAGGATCCTTATCTTCTTCTTGGGTTCTGAAAAAATAAAAACA +TCAACTCCCTCCTGTATTCATGCTTGGTGTTGGTTATGTGTCATGTCTGGTCTACTATTGACACATGTA +AATCCAAATGCACTAGAGCTGCCCTCAAAAAACTCAATTTGGT Gm10 Glyma10g00200.1 CDS_3 13713 13829 - 0 2 + 13 15 25 GATTGAGTTAGATGAGCTGGAAAGTGTTTCCATTCTGTCAATG +ACCCTTGCATGGGATGAATTTTCCTTCTCTACTTTTCAAGAAGCCCATTATTCACTTCAAGATTCTCTA +GACCA Gm10 Glyma10g00200.1 CDS_2 13915 14209 - 8 11 + 40 59 88 GCTGCTCTTCTCCCTCAGCTGCTCTGTCTCCGGCGTTGACATT +GGAGGAGGGACTTGAGAAACTGAAGGAGGCTCTCCAAATCTTGAATTCTCCTTCCCCTTCTTCCCCTAC +TGGATTCCTTAGGTTTCAGGTGGCGCTCCCTCCCAGTCCTAAGACCTTCACTTTGTTTTGCTCCCAACC +CCACTCCTCCTCGGTCTTTCCTCTCATTTATGTTTCCAAGAACGACGCCGACTCTAAATCACTCTATGT +CAATTTGGATGATAATGTGAGTCACCGGAAAGAAAGGTTCTTTTT Gm10 Glyma10g00200.1 CDS_1 14242 14286 - 2 1 + 9 12 21 GATCTCGTTCCCCCTCACCACATAACACCCTGTCGTTCCCTTC +AC Gm10 Glyma10g00210.1 CDS_7 15480 15722 - 4 7 + 26 37 47 GGTACGATCAAACACATTGGGAGCTGTTGGATTCCTTGGGGAT +AGCAGAAGAATCAATGTTGCCATCACAAGAGCACGCAAACATTTAGCTTTGGTCTGTGACAGCTCGACT +ATATGCCACAATACCTTCTTAGCAAGGCTTCTGCGTCATATTAGACACTTTGGTAGGGTGAAGCATGCA +GAACCAGGTAGTTTTGGAGGATATGGACTTGGGATGAATCCAATATTACCTTCCATTAATTA Gm10 Glyma10g00210.1 CDS_6 16625 16782 - 1 9 + 15 25 32 GGTGTTAGCCCAACAGCTATTGCAGTGCAATCCCCTTATGTTG +CTCAAGTACAACTTTTGAGGGACAAGCTTGATGAATTTCCAGAAGCAGCAGGTACTGAGGTTGCAACCA +TTGACAGTTTTCAAGGTCGGGAAGCTGATGCAGTAATTTTATCCAT Gm10 Glyma10g00210.1 CDS_5 17595 17763 - 2 9 + 16 27 34 GCCTACTTGGATAACACAATGCCCGCTGCTATTGCTAGATACT +AGAATGCCATATGGAAGTCTGTCAGTTGGTTGTGAAGAGCATCTAGACCCGGCTGGAACAGGCTCACTT +TATAATGAAGGAGAAGCTGAGATAGTTTTGCAGCATGTATTTTCCTTAATCTATGCC Gm10 Glyma10g00210.1 CDS_4 18046 19077 - 12 42 + 91 145 173 GCGCAACTGTGAAGCTTTAATGCTGCTTCAGAAGAATGGTTTA +CGAAAGAAGAATCCTTCAATTTCTGTTGTTGCTACACTGTTTGGAGATGGGGAAGATGTTGCATGGCTT +GAGAAAAATCATTTGGCTGACTGGGCAGAAGAAAAATTGGATGGAAGATTAGGAAATGAAACCTTTGAT +GATTCTCAGTGGAGAGCAATTGCAATGGGTTTGAATAAAAAGAGGCCTGTATTGGTTATCCAAGGCCCT +CCTGGTACAGGCAAGACTGGTTTGCTCAAGCAACTTATAGCATGTGCTGTTCAGCAAGGTGAAAGGGTT +CTTGTTACAGCACCTACTAATGCAGCTGTTGATAACATGGTAGAAAAGCTTTCAAATGTTGGATTAAAT +ATAGTGCGGGTTGGAAATCCAGCTCGTATATCAAAAACAGTGGGATCAAAGTCTTTGGAAGAAATTGTA +AATGCTAAGCTTGCAAGTTTTCGAGAAGAGTATGAGAGGAAGAAGTCAGATCTAAGAAAAGATCTAAGA +CATTGTTTAAGGGATGATTCACTAGCTTCAGGCATACGCCAACTTCTGAAGCAACTGGGAAGGTCACTG +AAGAAAAAGGAAAAGCAGACCGTAATTGAAGTTCTGTCTAGTGCTCAAGTTGTGGTTGCCACTAATACT +GGAGCAGCTGACCCTTTGGTTCGAAGGCTAGATACCTTTGATTTGGTTGTCATAGATGAAGCGGGACAG +GCAATTGAACCCTCTTGCTGGATTCCTATATTGCAGGGAAAGCGCTGCATTCTTGCTGGTGATCAATGC +CAACTTGCTCCTGTCATATTATCTAGAAAGGCCTTAGAAGTTGGTCTAGGAATATCTCTACTGGAGAGA +GCTGCAACTTTGCATGAAGGGATTCTCACCACTAGGTTAACAACACAATACCGTATGAATGATGCAATT +GCTAGTTGGGCTTCAAAGGAGATGTACGGAGGATTATTGAAGTCCTCTGAGACTGTCTTTTCTCATCTT +CTAGTAGACTCTCCTTTTGTTAA Gm10 Glyma10g00210.1 CDS_3 21109 21392 - 4 8 + 30 42 52 GGACTTGGGGGAATGCATTTGGTGTTATTCAAGGTTGAAGGGA +ACCACCGGTTACCACCAACCACTCTTTCACCTGGAGACATGGTTTGTGTGAGAACATATGACAGCATGG +GTGCAATTACAACTTCTTGCATACAAGGATTTGTGAACAGTTTTGGGGATGATGGCTATAGTATTACAG +TTGCTTTAGAGTCACGCCATGGTGATCCTACATTCTCTAAACTATTTGGGAAGAGTGTGCGCATTGACC +GTATTCAAGGACTGGCTGATACACTTACTTATGA Gm10 Glyma10g00210.1 CDS_2 24354 24662 - 9 13 + 32 54 66 GCTCAGCATAGGGCGGTTGTGAGAAAGATAACGCAACCCAAGA +GTGTCCAAGGCGTTTTAGGAATGGACTTCGAAAAGGTCAAAGCATTACAGCACAGGATTGACGAATTCA +CCACCCATATGTCAGAACTACTCCGTATTGAAAGGGATGCTGAGTTGGAGTTTACTCAGGAGGAATTGG +ATGCTGTTCCTAAACCAGATGATACTTCTGATTCTTCAAAAACGATTGATTTCTTGGTTAGCCATAGCC +AGCCTCAACAAGAACTCTGCGACACCATTTGTAATTTAAACGCTATCAGTACCTCTACA
Here's the second input file, as given in the OP:
assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9
And finally, here's the output based on all the above:
Chromosome Gene Feature Start END CG_count_by_C CHG_co +unt_by_C CHH_count_by_C Gm10 Glyma10g00200.1 CDS_11 8569 8705 0.143 0.071 0.000 Gm10 Glyma10g00200.1 CDS_9 8885 9079 0.040 0.000 0.000 Gm10 Glyma10g0test.1 CDS_11 8569 8705 0.143 0.071 0.000