Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

#!/usr/bin/perl use strict; use Benchmark; use Data::Dumper 'Dumper'; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^\s*Gm10/ ) { my ( $bgn, $end, $methcount ) = (split)[3,4,10]; $methrange{$bgn}{$end}{$_} = $methcount; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; my %methhash; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $class ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 ) { if ( $position >= $sorted_meth_starts[$indx] ) { my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { if ( $position <= $end ) { for my $match ( keys %{$methrange{$bgn}{$end}} ) { $methhash{$match}{methcount} = $methrange{$bgn +}{$end}{$match}; $methhash{$match}{$class}++; } } } last; } $indx--; } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
Here's my modified version of the gtf input (don't worry about the wrapped lines; if you click the "download" link at the end of the listing, you'll get the "original" version without line wrapping). Beware that the tabs may have been converted to spaces; also, the lines all start with a space -- the code posted above "does the right thing" with this file as-is, and will only fail on your own data if you have spaces embedded in some of your tab-delimited fields.
Gm10 Glyma10g00200.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g0test.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g00200.1 CDS_10 8822 8822 - 0 0 + 0 0 0 G Gm10 Glyma10g00200.1 CDS_9 8885 9079 - 2 13 + 23 38 50 GTATTTGTGTATTGCTCCTGGATCTAGACCTTCCCCTGTTGCT +GCTGCTGCTGCAAGTCACAAATTAATCACATGTATTTCATGCTTCGATGAGCGTTCACTTGCATATCAT +GCTGTTGGATATGGAAGAGGATCTCACATTCCAGCAGTAGCCATTACATCATCAGGCACTGCAGTTTCC +AACCTTCTTCCTGC Gm10 Glyma10g00200.1 CDS_8 9151 9255 - 2 2 + 11 15 19 GTTGGATCATGCCAATGAATTATCAAACTCCCTGAAAGAAAGT +GCCAATATTAACACTGTTCGGGCATCGCTTATTGTTGAAGAATGCACAAGACTTGGTTTGAT Gm10 Glyma10g00200.1 CDS_7 11534 11698 - 1 7 + 16 24 33 GGAATATACTGATCCTCACAAAGGTTCTCTTAACCTTGCTGTG +AAGTTGCCTAATGGTTCTCTAAAGCCAGACCTGGGGCCAAAAACATATATTGCTTATGGATTTCTTCAG +GAGCTTGGACGTGGTGATTCAGTGACTAAGCTCCATTGTGATATGTCTGATGC Gm10 Glyma10g00200.1 CDS_6 12777 12823 - 0 0 + 8 8 11 GCTTTTTACACAAATTTCATTCTTCACCTTCATTCTCATGAAG +ATTA Gm10 Glyma10g00200.1 CDS_5 12958 13022 - 1 4 + 4 9 10 GCTGTAGATCTTCAGTACAAGTATTTAAGGCATTTTCAGTGGC +ATTGGGAAAAGGGGGAACCGCT Gm10 Glyma10g00200.1 CDS_4 13348 13502 - 0 4 + 20 24 33 GAGGATCCTTATCTTCTTCTTGGGTTCTGAAAAAATAAAAACA +TCAACTCCCTCCTGTATTCATGCTTGGTGTTGGTTATGTGTCATGTCTGGTCTACTATTGACACATGTA +AATCCAAATGCACTAGAGCTGCCCTCAAAAAACTCAATTTGGT Gm10 Glyma10g00200.1 CDS_3 13713 13829 - 0 2 + 13 15 25 GATTGAGTTAGATGAGCTGGAAAGTGTTTCCATTCTGTCAATG +ACCCTTGCATGGGATGAATTTTCCTTCTCTACTTTTCAAGAAGCCCATTATTCACTTCAAGATTCTCTA +GACCA Gm10 Glyma10g00200.1 CDS_2 13915 14209 - 8 11 + 40 59 88 GCTGCTCTTCTCCCTCAGCTGCTCTGTCTCCGGCGTTGACATT +GGAGGAGGGACTTGAGAAACTGAAGGAGGCTCTCCAAATCTTGAATTCTCCTTCCCCTTCTTCCCCTAC +TGGATTCCTTAGGTTTCAGGTGGCGCTCCCTCCCAGTCCTAAGACCTTCACTTTGTTTTGCTCCCAACC +CCACTCCTCCTCGGTCTTTCCTCTCATTTATGTTTCCAAGAACGACGCCGACTCTAAATCACTCTATGT +CAATTTGGATGATAATGTGAGTCACCGGAAAGAAAGGTTCTTTTT Gm10 Glyma10g00200.1 CDS_1 14242 14286 - 2 1 + 9 12 21 GATCTCGTTCCCCCTCACCACATAACACCCTGTCGTTCCCTTC +AC Gm10 Glyma10g00210.1 CDS_7 15480 15722 - 4 7 + 26 37 47 GGTACGATCAAACACATTGGGAGCTGTTGGATTCCTTGGGGAT +AGCAGAAGAATCAATGTTGCCATCACAAGAGCACGCAAACATTTAGCTTTGGTCTGTGACAGCTCGACT +ATATGCCACAATACCTTCTTAGCAAGGCTTCTGCGTCATATTAGACACTTTGGTAGGGTGAAGCATGCA +GAACCAGGTAGTTTTGGAGGATATGGACTTGGGATGAATCCAATATTACCTTCCATTAATTA Gm10 Glyma10g00210.1 CDS_6 16625 16782 - 1 9 + 15 25 32 GGTGTTAGCCCAACAGCTATTGCAGTGCAATCCCCTTATGTTG +CTCAAGTACAACTTTTGAGGGACAAGCTTGATGAATTTCCAGAAGCAGCAGGTACTGAGGTTGCAACCA +TTGACAGTTTTCAAGGTCGGGAAGCTGATGCAGTAATTTTATCCAT Gm10 Glyma10g00210.1 CDS_5 17595 17763 - 2 9 + 16 27 34 GCCTACTTGGATAACACAATGCCCGCTGCTATTGCTAGATACT +AGAATGCCATATGGAAGTCTGTCAGTTGGTTGTGAAGAGCATCTAGACCCGGCTGGAACAGGCTCACTT +TATAATGAAGGAGAAGCTGAGATAGTTTTGCAGCATGTATTTTCCTTAATCTATGCC Gm10 Glyma10g00210.1 CDS_4 18046 19077 - 12 42 + 91 145 173 GCGCAACTGTGAAGCTTTAATGCTGCTTCAGAAGAATGGTTTA +CGAAAGAAGAATCCTTCAATTTCTGTTGTTGCTACACTGTTTGGAGATGGGGAAGATGTTGCATGGCTT +GAGAAAAATCATTTGGCTGACTGGGCAGAAGAAAAATTGGATGGAAGATTAGGAAATGAAACCTTTGAT +GATTCTCAGTGGAGAGCAATTGCAATGGGTTTGAATAAAAAGAGGCCTGTATTGGTTATCCAAGGCCCT +CCTGGTACAGGCAAGACTGGTTTGCTCAAGCAACTTATAGCATGTGCTGTTCAGCAAGGTGAAAGGGTT +CTTGTTACAGCACCTACTAATGCAGCTGTTGATAACATGGTAGAAAAGCTTTCAAATGTTGGATTAAAT +ATAGTGCGGGTTGGAAATCCAGCTCGTATATCAAAAACAGTGGGATCAAAGTCTTTGGAAGAAATTGTA +AATGCTAAGCTTGCAAGTTTTCGAGAAGAGTATGAGAGGAAGAAGTCAGATCTAAGAAAAGATCTAAGA +CATTGTTTAAGGGATGATTCACTAGCTTCAGGCATACGCCAACTTCTGAAGCAACTGGGAAGGTCACTG +AAGAAAAAGGAAAAGCAGACCGTAATTGAAGTTCTGTCTAGTGCTCAAGTTGTGGTTGCCACTAATACT +GGAGCAGCTGACCCTTTGGTTCGAAGGCTAGATACCTTTGATTTGGTTGTCATAGATGAAGCGGGACAG +GCAATTGAACCCTCTTGCTGGATTCCTATATTGCAGGGAAAGCGCTGCATTCTTGCTGGTGATCAATGC +CAACTTGCTCCTGTCATATTATCTAGAAAGGCCTTAGAAGTTGGTCTAGGAATATCTCTACTGGAGAGA +GCTGCAACTTTGCATGAAGGGATTCTCACCACTAGGTTAACAACACAATACCGTATGAATGATGCAATT +GCTAGTTGGGCTTCAAAGGAGATGTACGGAGGATTATTGAAGTCCTCTGAGACTGTCTTTTCTCATCTT +CTAGTAGACTCTCCTTTTGTTAA Gm10 Glyma10g00210.1 CDS_3 21109 21392 - 4 8 + 30 42 52 GGACTTGGGGGAATGCATTTGGTGTTATTCAAGGTTGAAGGGA +ACCACCGGTTACCACCAACCACTCTTTCACCTGGAGACATGGTTTGTGTGAGAACATATGACAGCATGG +GTGCAATTACAACTTCTTGCATACAAGGATTTGTGAACAGTTTTGGGGATGATGGCTATAGTATTACAG +TTGCTTTAGAGTCACGCCATGGTGATCCTACATTCTCTAAACTATTTGGGAAGAGTGTGCGCATTGACC +GTATTCAAGGACTGGCTGATACACTTACTTATGA Gm10 Glyma10g00210.1 CDS_2 24354 24662 - 9 13 + 32 54 66 GCTCAGCATAGGGCGGTTGTGAGAAAGATAACGCAACCCAAGA +GTGTCCAAGGCGTTTTAGGAATGGACTTCGAAAAGGTCAAAGCATTACAGCACAGGATTGACGAATTCA +CCACCCATATGTCAGAACTACTCCGTATTGAAAGGGATGCTGAGTTGGAGTTTACTCAGGAGGAATTGG +ATGCTGTTCCTAAACCAGATGATACTTCTGATTCTTCAAAAACGATTGATTTCTTGGTTAGCCATAGCC +AGCCTCAACAAGAACTCTGCGACACCATTTGTAATTTAAACGCTATCAGTACCTCTACA
Here's the second input file, as given in the OP:
assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9
And finally, here's the output based on all the above:
Chromosome Gene Feature Start END CG_count_by_C CHG_co +unt_by_C CHH_count_by_C Gm10 Glyma10g00200.1 CDS_11 8569 8705 0.143 0.071 0.000 Gm10 Glyma10g00200.1 CDS_9 8885 9079 0.040 0.000 0.000 Gm10 Glyma10g0test.1 CDS_11 8569 8705 0.143 0.071 0.000

In reply to Re^3: Comparing and getting information from two large files and appending it in a new file by graff
in thread Comparing and getting information from two large files and appending it in a new file by perlkhan77

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-24 22:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found