Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: Comparing and getting information from two large files and appending it in a new file

by graff (Chancellor)
on Mar 31, 2012 at 23:21 UTC ( #962807=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Comparing and getting information from two large files and appending it in a new file
in thread Comparing and getting information from two large files and appending it in a new file

... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

#!/usr/bin/perl use strict; use Benchmark; use Data::Dumper 'Dumper'; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^\s*Gm10/ ) { my ( $bgn, $end, $methcount ) = (split)[3,4,10]; $methrange{$bgn}{$end}{$_} = $methcount; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; my %methhash; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $class ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 ) { if ( $position >= $sorted_meth_starts[$indx] ) { my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { if ( $position <= $end ) { for my $match ( keys %{$methrange{$bgn}{$end}} ) { $methhash{$match}{methcount} = $methrange{$bgn +}{$end}{$match}; $methhash{$match}{$class}++; } } } last; } $indx--; } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
Here's my modified version of the gtf input (don't worry about the wrapped lines; if you click the "download" link at the end of the listing, you'll get the "original" version without line wrapping). Beware that the tabs may have been converted to spaces; also, the lines all start with a space -- the code posted above "does the right thing" with this file as-is, and will only fail on your own data if you have spaces embedded in some of your tab-delimited fields.
Gm10 Glyma10g00200.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g0test.1 CDS_11 8569 8705 - 2 2 + 6 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAAT +GTCTTGATTTGGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTA +GCCTTTATTTTATTTTGAATGTGTG Gm10 Glyma10g00200.1 CDS_10 8822 8822 - 0 0 + 0 0 0 G Gm10 Glyma10g00200.1 CDS_9 8885 9079 - 2 13 + 23 38 50 GTATTTGTGTATTGCTCCTGGATCTAGACCTTCCCCTGTTGCT +GCTGCTGCTGCAAGTCACAAATTAATCACATGTATTTCATGCTTCGATGAGCGTTCACTTGCATATCAT +GCTGTTGGATATGGAAGAGGATCTCACATTCCAGCAGTAGCCATTACATCATCAGGCACTGCAGTTTCC +AACCTTCTTCCTGC Gm10 Glyma10g00200.1 CDS_8 9151 9255 - 2 2 + 11 15 19 GTTGGATCATGCCAATGAATTATCAAACTCCCTGAAAGAAAGT +GCCAATATTAACACTGTTCGGGCATCGCTTATTGTTGAAGAATGCACAAGACTTGGTTTGAT Gm10 Glyma10g00200.1 CDS_7 11534 11698 - 1 7 + 16 24 33 GGAATATACTGATCCTCACAAAGGTTCTCTTAACCTTGCTGTG +AAGTTGCCTAATGGTTCTCTAAAGCCAGACCTGGGGCCAAAAACATATATTGCTTATGGATTTCTTCAG +GAGCTTGGACGTGGTGATTCAGTGACTAAGCTCCATTGTGATATGTCTGATGC Gm10 Glyma10g00200.1 CDS_6 12777 12823 - 0 0 + 8 8 11 GCTTTTTACACAAATTTCATTCTTCACCTTCATTCTCATGAAG +ATTA Gm10 Glyma10g00200.1 CDS_5 12958 13022 - 1 4 + 4 9 10 GCTGTAGATCTTCAGTACAAGTATTTAAGGCATTTTCAGTGGC +ATTGGGAAAAGGGGGAACCGCT Gm10 Glyma10g00200.1 CDS_4 13348 13502 - 0 4 + 20 24 33 GAGGATCCTTATCTTCTTCTTGGGTTCTGAAAAAATAAAAACA +TCAACTCCCTCCTGTATTCATGCTTGGTGTTGGTTATGTGTCATGTCTGGTCTACTATTGACACATGTA +AATCCAAATGCACTAGAGCTGCCCTCAAAAAACTCAATTTGGT Gm10 Glyma10g00200.1 CDS_3 13713 13829 - 0 2 + 13 15 25 GATTGAGTTAGATGAGCTGGAAAGTGTTTCCATTCTGTCAATG +ACCCTTGCATGGGATGAATTTTCCTTCTCTACTTTTCAAGAAGCCCATTATTCACTTCAAGATTCTCTA +GACCA Gm10 Glyma10g00200.1 CDS_2 13915 14209 - 8 11 + 40 59 88 GCTGCTCTTCTCCCTCAGCTGCTCTGTCTCCGGCGTTGACATT +GGAGGAGGGACTTGAGAAACTGAAGGAGGCTCTCCAAATCTTGAATTCTCCTTCCCCTTCTTCCCCTAC +TGGATTCCTTAGGTTTCAGGTGGCGCTCCCTCCCAGTCCTAAGACCTTCACTTTGTTTTGCTCCCAACC +CCACTCCTCCTCGGTCTTTCCTCTCATTTATGTTTCCAAGAACGACGCCGACTCTAAATCACTCTATGT +CAATTTGGATGATAATGTGAGTCACCGGAAAGAAAGGTTCTTTTT Gm10 Glyma10g00200.1 CDS_1 14242 14286 - 2 1 + 9 12 21 GATCTCGTTCCCCCTCACCACATAACACCCTGTCGTTCCCTTC +AC Gm10 Glyma10g00210.1 CDS_7 15480 15722 - 4 7 + 26 37 47 GGTACGATCAAACACATTGGGAGCTGTTGGATTCCTTGGGGAT +AGCAGAAGAATCAATGTTGCCATCACAAGAGCACGCAAACATTTAGCTTTGGTCTGTGACAGCTCGACT +ATATGCCACAATACCTTCTTAGCAAGGCTTCTGCGTCATATTAGACACTTTGGTAGGGTGAAGCATGCA +GAACCAGGTAGTTTTGGAGGATATGGACTTGGGATGAATCCAATATTACCTTCCATTAATTA Gm10 Glyma10g00210.1 CDS_6 16625 16782 - 1 9 + 15 25 32 GGTGTTAGCCCAACAGCTATTGCAGTGCAATCCCCTTATGTTG +CTCAAGTACAACTTTTGAGGGACAAGCTTGATGAATTTCCAGAAGCAGCAGGTACTGAGGTTGCAACCA +TTGACAGTTTTCAAGGTCGGGAAGCTGATGCAGTAATTTTATCCAT Gm10 Glyma10g00210.1 CDS_5 17595 17763 - 2 9 + 16 27 34 GCCTACTTGGATAACACAATGCCCGCTGCTATTGCTAGATACT +AGAATGCCATATGGAAGTCTGTCAGTTGGTTGTGAAGAGCATCTAGACCCGGCTGGAACAGGCTCACTT +TATAATGAAGGAGAAGCTGAGATAGTTTTGCAGCATGTATTTTCCTTAATCTATGCC Gm10 Glyma10g00210.1 CDS_4 18046 19077 - 12 42 + 91 145 173 GCGCAACTGTGAAGCTTTAATGCTGCTTCAGAAGAATGGTTTA +CGAAAGAAGAATCCTTCAATTTCTGTTGTTGCTACACTGTTTGGAGATGGGGAAGATGTTGCATGGCTT +GAGAAAAATCATTTGGCTGACTGGGCAGAAGAAAAATTGGATGGAAGATTAGGAAATGAAACCTTTGAT +GATTCTCAGTGGAGAGCAATTGCAATGGGTTTGAATAAAAAGAGGCCTGTATTGGTTATCCAAGGCCCT +CCTGGTACAGGCAAGACTGGTTTGCTCAAGCAACTTATAGCATGTGCTGTTCAGCAAGGTGAAAGGGTT +CTTGTTACAGCACCTACTAATGCAGCTGTTGATAACATGGTAGAAAAGCTTTCAAATGTTGGATTAAAT +ATAGTGCGGGTTGGAAATCCAGCTCGTATATCAAAAACAGTGGGATCAAAGTCTTTGGAAGAAATTGTA +AATGCTAAGCTTGCAAGTTTTCGAGAAGAGTATGAGAGGAAGAAGTCAGATCTAAGAAAAGATCTAAGA +CATTGTTTAAGGGATGATTCACTAGCTTCAGGCATACGCCAACTTCTGAAGCAACTGGGAAGGTCACTG +AAGAAAAAGGAAAAGCAGACCGTAATTGAAGTTCTGTCTAGTGCTCAAGTTGTGGTTGCCACTAATACT +GGAGCAGCTGACCCTTTGGTTCGAAGGCTAGATACCTTTGATTTGGTTGTCATAGATGAAGCGGGACAG +GCAATTGAACCCTCTTGCTGGATTCCTATATTGCAGGGAAAGCGCTGCATTCTTGCTGGTGATCAATGC +CAACTTGCTCCTGTCATATTATCTAGAAAGGCCTTAGAAGTTGGTCTAGGAATATCTCTACTGGAGAGA +GCTGCAACTTTGCATGAAGGGATTCTCACCACTAGGTTAACAACACAATACCGTATGAATGATGCAATT +GCTAGTTGGGCTTCAAAGGAGATGTACGGAGGATTATTGAAGTCCTCTGAGACTGTCTTTTCTCATCTT +CTAGTAGACTCTCCTTTTGTTAA Gm10 Glyma10g00210.1 CDS_3 21109 21392 - 4 8 + 30 42 52 GGACTTGGGGGAATGCATTTGGTGTTATTCAAGGTTGAAGGGA +ACCACCGGTTACCACCAACCACTCTTTCACCTGGAGACATGGTTTGTGTGAGAACATATGACAGCATGG +GTGCAATTACAACTTCTTGCATACAAGGATTTGTGAACAGTTTTGGGGATGATGGCTATAGTATTACAG +TTGCTTTAGAGTCACGCCATGGTGATCCTACATTCTCTAAACTATTTGGGAAGAGTGTGCGCATTGACC +GTATTCAAGGACTGGCTGATACACTTACTTATGA Gm10 Glyma10g00210.1 CDS_2 24354 24662 - 9 13 + 32 54 66 GCTCAGCATAGGGCGGTTGTGAGAAAGATAACGCAACCCAAGA +GTGTCCAAGGCGTTTTAGGAATGGACTTCGAAAAGGTCAAAGCATTACAGCACAGGATTGACGAATTCA +CCACCCATATGTCAGAACTACTCCGTATTGAAAGGGATGCTGAGTTGGAGTTTACTCAGGAGGAATTGG +ATGCTGTTCCTAAACCAGATGATACTTCTGATTCTTCAAAAACGATTGATTTCTTGGTTAGCCATAGCC +AGCCTCAACAAGAACTCTGCGACACCATTTGTAATTTAAACGCTATCAGTACCTCTACA
Here's the second input file, as given in the OP:
assembly position strand class mc h 10 8691 + CG 12 12 10 8692 - CG 3 3 10 8705 + CHG 7 18 10 8707 - CHG 2 4 10 8717 + CG 25 25 10 8718 - CG 6 6 10 8984 + CG 7 7 10 8991 + CG 6 9
And finally, here's the output based on all the above:
Chromosome Gene Feature Start END CG_count_by_C CHG_co +unt_by_C CHH_count_by_C Gm10 Glyma10g00200.1 CDS_11 8569 8705 0.143 0.071 0.000 Gm10 Glyma10g00200.1 CDS_9 8885 9079 0.040 0.000 0.000 Gm10 Glyma10g0test.1 CDS_11 8569 8705 0.143 0.071 0.000


Comment on Re^3: Comparing and getting information from two large files and appending it in a new file
Select or Download Code
Re^4: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Mar 31, 2012 at 23:37 UTC

    Thanks graff the code takes around :- 4115 wallclock secs (4112.37 usr + 0.28 sys = 4112.65 CPU) for running one methylome file that is similar to file2 in the above comments.

    I have been programming for quite some time but I have mostly used cookbook solutions to make things work and that seem to have stunted my learning in perl or programming as a whole of solutions like indexing. What would be a good way to learn such tricks as indexing e.t.c. Thanks again for all your help

      What would be a good way to learn such tricks as indexing e.t.c.

      in the context of setting up a table on a reasonably mature relational database engine (mysql, postgres, etc), creating an index on a given field is simply a matter of telling the database engine that you want that field to be indexed. The engine handles the rest for you -- that's one of the things that makes database engines so attractive. For example:

      CREATE TABLE genome ( id int AUTO_INCREMENT PRIMARY KEY, start int, end int, strand varchar(6), ... -- include other fields as needed INDEX start, INDEX end ... -- index other fields that are often useful in queries )
      By declaring that those two integer fields are to be indexed, mysql will take care of all the back-end work to make sure that queries like the following would execute with a minimum amount of time spent searching and comparing:
      -- assume your current input data record has a "position" value of 876 +5: select * from genome where start <= 8765 and end >= 8765
      As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else (e.g. how long does it take to run a simple one-liner, like:  perl -ne '$sz+=length(); END {print $sz,"\n"}' )?
Re^4: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Apr 01, 2012 at 01:08 UTC

    By indexing tricks I was referring to perl I am aware of the indexing that happens in mysql. Also the time I gave in the previous post was for your previous program. Your new program takes 4297 wallclock secs (4294.19 usr + 0.29 sys = 4294.48 CPU).

    As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else

    It does not take long to read the larger file what takes longer is the looping and in matching. Previously there were two loops (a for inside a while) but now there are three (a for inside a for inside a while) due to HOHOH. For a quickfix solution this is ok but I would have to eventually use mysql but even in mysql I would have to update a particular class match every time if a particular position falls in that range for that gene but hopefully it would be faster than this.

    Thanks again for all the help.

Re^4: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Apr 01, 2012 at 01:32 UTC

    Hi Graff one more thing about the result file being produced the number of lines having Gm10 assembly in Methylation.gtf are 26637 and thus the final result should have the same number with 0 value for those genes which have no CG CHG or CHH count while right now it only prints 8626 lines including the header. Sorry to trouble you about that but if you can let me know what changes should I make in the code to make it possible

    Thanks again

      To get one line of output for every line in your first input file, there's few changes.
      ... my %methrange; my %methhash; # this line had been further down in my prev.version, j +ust move it up ... if ( /^\s*Gm10/ ) { my ( $bgn, $end, $methcount ) = (split)[3,4,10]; $methrange{$bgn}{$end}{$_} = undef; $methhash{$_}(methcount} = $methcount; # moved up from below } ... for my $end ( keys %{$methrange{$bgn}} ) { if ( $position <= $end ) { for my $match ( keys %{$methrange{$bgn}{$end}} ) { $methhash{$match}{$class}++; # methcount was +moved from here } } } ...
      As for the benchmark, you said that your OP version "took forever". Was "forever" more than an hour and a half? (Did my version yield any improvement at all?) Do you have specific constraints about how much time can be taken up by a single run? If not, I'd say focus more on making sure the output is correct, rather than how long it takes to produce the output.
Re^4: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Apr 01, 2012 at 03:14 UTC

    Your's is a lot more faster than my code mine did not yield result even after running for six hours. Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962807]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (14)
As of 2014-12-22 18:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (126 votes), past polls