Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Comparing and getting information from two large files and appending it in a new file

by graff (Chancellor)
on Mar 31, 2012 at 03:05 UTC ( #962711=note: print w/ replies, xml ) Need Help??


in reply to Comparing and getting information from two large files and appending it in a new file

Well, your first problem is that you are doing a lot of the same work over and over again on the @genome array every time you read a line from INPUT (the "mc_11268_10.txt" file). You should build a data structure for the "genome" information as you read the data from the first file, so that you can use the data more efficiently while reading the second file.

Since you seem to be using whole lines of the first file as hash keys, you could just load that hash while reading the first file. Since most of the effort when reading the second file is to locate a matching range for each line of data, the crux of the problem is to figure out how to do this as quickly as possible.

And, why do you use Tie::Autotie "Tie::IxHash"? Since you sort the keys to do your output, I think you could use just a plain old hash.

Given the sample data you've shown (which I think has a mistake: did you mean to put '#' at the start of both lines in the first file?), it's hard to tell what proper output should look like, because the posted data produces no output with the posted code. So, I can't really be sure whether the alternative below would do the right thing -- I can only tell that it produces the same result as the OP code when using the OP data samples.

I've taken the trouble to make it work with "use strict", just to show you that doing this really doesn't involve a lot of effort, and is well worth it. The logic is very different: build a couple hashes while reading the first file, and also create a sorted array of the "Start" values, to reduce the amount of work that needs to be done while reading the second file. Overall, this uses "split" a lot less often, and when looking to see which range a given INPUT line falls into, it stops looping over ranges as soon as the right one is found.

(UPDATE: Shortly after posting, I realized that my range-matching logic might produce lower counts in its output than yours -- in particular, if different "genome" records in the first file have overlapping ranges, and if a given input from the second file could match multiple ranges that happen to have different starting values, then my code will only increment the first matching range. It wouldn't be hard to fix, but if that condition never comes up, there's no need to fix it.)

#!/usr/bin/perl use strict; use Benchmark; my %methhash; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^Gm10/ ) { my ( $bgn, $end, $methcount ) = (split /\t/ )[3,4,10]; $methhash{$_}{methcount} = $methcount; $methrange{$bgn}{$end} = $_; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $strand ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 and $position < $sorted_meth_starts[$indx] ) { $indx--; } next if ( $indx < 0 ); # skip if $position is lower than start of +lowest range my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { next if ( $position > $end ); # skip if position is outside th +is range my $match = $methrange{$bgn}{$end}; next if ( $match !~ /^$meth/ ); # skip if assembly doesn't mat +ch if ( $strand =~ /^(CH?G|CHH)$/ ) { $methhash{$match}{$strand}++; } } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
If that produces output that isn't right, and if you can't grok how it works (or how it fails), you'll need to post some relevant data samples that produce output (both good and bad).


Comment on Re: Comparing and getting information from two large files and appending it in a new file
Select or Download Code
Re^2: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Mar 31, 2012 at 04:11 UTC

    The lines of the First file are long so they are being printed in multiple lines. Previously I only gave one line now I am including many lines in the following file sample for file 1 I hope file2 format is understandable.

    Gm10 Glyma10g00200.1 CDS_11 8569 8705 - 2 2 6 + 10 14 GAAGAACCAAAATCGCTGATTTTGAAAAATAGGGGGACTGAATGTCTTGATTT +GGAAATAGGGGGACCAAAATTGCGGATTGAGCATTATAGGGGACCAATAGTGTATTTTAGCCTTTATTT +TATTTTGAATGTGTG Gm10 Glyma10g00200.1 CDS_10 8822 8822 - 0 0 0 + 0 0 G Gm10 Glyma10g00200.1 CDS_9 8885 9079 - 2 13 2 +3 38 50 GTATTTGTGTATTGCTCCTGGATCTAGACCTTCCCCTGTTGCTGCTGCTGCT +GCAAGTCACAAATTAATCACATGTATTTCATGCTTCGATGAGCGTTCACTTGCATATCATGCTGTTGGA +TATGGAAGAGGATCTCACATTCCAGCAGTAGCCATTACATCATCAGGCACTGCAGTTTCCAACCTTCTT +CCTGC Gm10 Glyma10g00200.1 CDS_8 9151 9255 - 2 2 11 + 15 19 GTTGGATCATGCCAATGAATTATCAAACTCCCTGAAAGAAAGTGCCAATATTA +ACACTGTTCGGGCATCGCTTATTGTTGAAGAATGCACAAGACTTGGTTTGAT Gm10 Glyma10g00200.1 CDS_7 11534 11698 - 1 7 +16 24 33 GGAATATACTGATCCTCACAAAGGTTCTCTTAACCTTGCTGTGAAGTTGCC +TAATGGTTCTCTAAAGCCAGACCTGGGGCCAAAAACATATATTGCTTATGGATTTCTTCAGGAGCTTGG +ACGTGGTGATTCAGTGACTAAGCTCCATTGTGATATGTCTGATGC Gm10 Glyma10g00200.1 CDS_6 12777 12823 - 0 0 +8 8 11 GCTTTTTACACAAATTTCATTCTTCACCTTCATTCTCATGAAGATTA Gm10 Glyma10g00200.1 CDS_5 12958 13022 - 1 4 +4 9 10 GCTGTAGATCTTCAGTACAAGTATTTAAGGCATTTTCAGTGGCATTGGGAAAA +GGGGGAACCGCT Gm10 Glyma10g00200.1 CDS_4 13348 13502 - 0 4 +20 24 33 GAGGATCCTTATCTTCTTCTTGGGTTCTGAAAAAATAAAAACATCAACTCC +CTCCTGTATTCATGCTTGGTGTTGGTTATGTGTCATGTCTGGTCTACTATTGACACATGTAAATCCAAA +TGCACTAGAGCTGCCCTCAAAAAACTCAATTTGGT Gm10 Glyma10g00200.1 CDS_3 13713 13829 - 0 2 +13 15 25 GATTGAGTTAGATGAGCTGGAAAGTGTTTCCATTCTGTCAATGACCCTTGC +ATGGGATGAATTTTCCTTCTCTACTTTTCAAGAAGCCCATTATTCACTTCAAGATTCTCTAGACCA Gm10 Glyma10g00200.1 CDS_2 13915 14209 - 8 11 + 40 59 88 GCTGCTCTTCTCCCTCAGCTGCTCTGTCTCCGGCGTTGACATTGGAGGAG +GGACTTGAGAAACTGAAGGAGGCTCTCCAAATCTTGAATTCTCCTTCCCCTTCTTCCCCTACTGGATTC +CTTAGGTTTCAGGTGGCGCTCCCTCCCAGTCCTAAGACCTTCACTTTGTTTTGCTCCCAACCCCACTCC +TCCTCGGTCTTTCCTCTCATTTATGTTTCCAAGAACGACGCCGACTCTAAATCACTCTATGTCAATTTG +GATGATAATGTGAGTCACCGGAAAGAAAGGTTCTTTTT Gm10 Glyma10g00200.1 CDS_1 14242 14286 - 2 1 +9 12 21 GATCTCGTTCCCCCTCACCACATAACACCCTGTCGTTCCCTTCAC Gm10 Glyma10g00210.1 CDS_7 15480 15722 - 4 7 +26 37 47 GGTACGATCAAACACATTGGGAGCTGTTGGATTCCTTGGGGATAGCAGAAG +AATCAATGTTGCCATCACAAGAGCACGCAAACATTTAGCTTTGGTCTGTGACAGCTCGACTATATGCCA +CAATACCTTCTTAGCAAGGCTTCTGCGTCATATTAGACACTTTGGTAGGGTGAAGCATGCAGAACCAGG +TAGTTTTGGAGGATATGGACTTGGGATGAATCCAATATTACCTTCCATTAATTA Gm10 Glyma10g00210.1 CDS_6 16625 16782 - 1 9 +15 25 32 GGTGTTAGCCCAACAGCTATTGCAGTGCAATCCCCTTATGTTGCTCAAGTA +CAACTTTTGAGGGACAAGCTTGATGAATTTCCAGAAGCAGCAGGTACTGAGGTTGCAACCATTGACAGT +TTTCAAGGTCGGGAAGCTGATGCAGTAATTTTATCCAT Gm10 Glyma10g00210.1 CDS_5 17595 17763 - 2 9 +16 27 34 GCCTACTTGGATAACACAATGCCCGCTGCTATTGCTAGATACTAGAATGCC +ATATGGAAGTCTGTCAGTTGGTTGTGAAGAGCATCTAGACCCGGCTGGAACAGGCTCACTTTATAATGA +AGGAGAAGCTGAGATAGTTTTGCAGCATGTATTTTCCTTAATCTATGCC Gm10 Glyma10g00210.1 CDS_4 18046 19077 - 12 42 + 91 145 173 GCGCAACTGTGAAGCTTTAATGCTGCTTCAGAAGAATGGTTTACGAA +AGAAGAATCCTTCAATTTCTGTTGTTGCTACACTGTTTGGAGATGGGGAAGATGTTGCATGGCTTGAGA +AAAATCATTTGGCTGACTGGGCAGAAGAAAAATTGGATGGAAGATTAGGAAATGAAACCTTTGATGATT +CTCAGTGGAGAGCAATTGCAATGGGTTTGAATAAAAAGAGGCCTGTATTGGTTATCCAAGGCCCTCCTG +GTACAGGCAAGACTGGTTTGCTCAAGCAACTTATAGCATGTGCTGTTCAGCAAGGTGAAAGGGTTCTTG +TTACAGCACCTACTAATGCAGCTGTTGATAACATGGTAGAAAAGCTTTCAAATGTTGGATTAAATATAG +TGCGGGTTGGAAATCCAGCTCGTATATCAAAAACAGTGGGATCAAAGTCTTTGGAAGAAATTGTAAATG +CTAAGCTTGCAAGTTTTCGAGAAGAGTATGAGAGGAAGAAGTCAGATCTAAGAAAAGATCTAAGACATT +GTTTAAGGGATGATTCACTAGCTTCAGGCATACGCCAACTTCTGAAGCAACTGGGAAGGTCACTGAAGA +AAAAGGAAAAGCAGACCGTAATTGAAGTTCTGTCTAGTGCTCAAGTTGTGGTTGCCACTAATACTGGAG +CAGCTGACCCTTTGGTTCGAAGGCTAGATACCTTTGATTTGGTTGTCATAGATGAAGCGGGACAGGCAA +TTGAACCCTCTTGCTGGATTCCTATATTGCAGGGAAAGCGCTGCATTCTTGCTGGTGATCAATGCCAAC +TTGCTCCTGTCATATTATCTAGAAAGGCCTTAGAAGTTGGTCTAGGAATATCTCTACTGGAGAGAGCTG +CAACTTTGCATGAAGGGATTCTCACCACTAGGTTAACAACACAATACCGTATGAATGATGCAATTGCTA +GTTGGGCTTCAAAGGAGATGTACGGAGGATTATTGAAGTCCTCTGAGACTGTCTTTTCTCATCTTCTAG +TAGACTCTCCTTTTGTTAA Gm10 Glyma10g00210.1 CDS_3 21109 21392 - 4 8 +30 42 52 GGACTTGGGGGAATGCATTTGGTGTTATTCAAGGTTGAAGGGAACCACCGG +TTACCACCAACCACTCTTTCACCTGGAGACATGGTTTGTGTGAGAACATATGACAGCATGGGTGCAATT +ACAACTTCTTGCATACAAGGATTTGTGAACAGTTTTGGGGATGATGGCTATAGTATTACAGTTGCTTTA +GAGTCACGCCATGGTGATCCTACATTCTCTAAACTATTTGGGAAGAGTGTGCGCATTGACCGTATTCAA +GGACTGGCTGATACACTTACTTATGA Gm10 Glyma10g00210.1 CDS_2 24354 24662 - 9 13 + 32 54 66 GCTCAGCATAGGGCGGTTGTGAGAAAGATAACGCAACCCAAGAGTGTCCA +AGGCGTTTTAGGAATGGACTTCGAAAAGGTCAAAGCATTACAGCACAGGATTGACGAATTCACCACCCA +TATGTCAGAACTACTCCGTATTGAAAGGGATGCTGAGTTGGAGTTTACTCAGGAGGAATTGGATGCTGT +TCCTAAACCAGATGATACTTCTGATTCTTCAAAAACGATTGATTTCTTGGTTAGCCATAGCCAGCCTCA +ACAAGAACTCTGCGACACCATTTGTAATTTAAACGCTATCAGTACCTCTACA

    There are a total of 11 tab separated values in each line 1.) Chromosome (in this case Gm10) 2.) gene id(or Gene) 3.) Feature( which could be of many types CDS_\d or 5'UTR_\d or 3'UTR_\d e.t.c 4.) Start position of the gene 5.) End position 6.) Strand type + or - 7 to 11 ) different counts of different types 12) the Sequence.

    Thanks graff for doing all the work. I'll try to see how your code works. But my other issue is I have 80 such files to compare with Methylation.gtf. I hope it can be done more efficiently. Thanks btw for all help appreciate it

Re^2: Comparing and getting information from two large files and appending it in a new file
by perlkhan77 (Acolyte) on Mar 31, 2012 at 05:01 UTC

    Hi Graff I was going over your code and I think there are few issues in the begining that you might suggest a way out of.

    $methrange{$bgn}{$end} = $_;

    Here you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code. Also I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here. What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??

      I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here.

      Given the larger sample of data that you gave for the first file, it's clear that the code in my initial reply wasn't working as intended. Sorry about that. I'll give it one more try and let you know how that comes out, but in the meantime...

      What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??

      Actually, when you've got a clear layout of the logical structure of the inputs, the decision process and the desired outputs, it's reasonably likely that a relational-table/SQL-based solution will help a lot. It has the same prerequisites as writing a perl script in order to get the right answers: having the right way to describe the task. Once you have that, it'll probably be a lot easier to write a suitable SQL statement to express the conditions and accomplish the operations that need to be done. If nothing else, just being able to deal with relevant subsets of the data at a time, rather than having to slog through one huge, monolithic stream, would be a win.

      I don't know enough about SQLite to say how good it would be with optimizing queries, but Mysql is easy to set up, easy to use, and quite effective for very large data sets (and it's very well documented). So long as you make sure to apply indexing on the table fields that get used a lot in queries, you should see pretty zippy results. You get a lot of built-in efficiency for free.

      ... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

      Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

      Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

      I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

        Thanks graff the code takes around :- 4115 wallclock secs (4112.37 usr + 0.28 sys = 4112.65 CPU) for running one methylome file that is similar to file2 in the above comments.

        I have been programming for quite some time but I have mostly used cookbook solutions to make things work and that seem to have stunted my learning in perl or programming as a whole of solutions like indexing. What would be a good way to learn such tricks as indexing e.t.c. Thanks again for all your help

        By indexing tricks I was referring to perl I am aware of the indexing that happens in mysql. Also the time I gave in the previous post was for your previous program. Your new program takes 4297 wallclock secs (4294.19 usr + 0.29 sys = 4294.48 CPU).

        As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else

        It does not take long to read the larger file what takes longer is the looping and in matching. Previously there were two loops (a for inside a while) but now there are three (a for inside a for inside a while) due to HOHOH. For a quickfix solution this is ok but I would have to eventually use mysql but even in mysql I would have to update a particular class match every time if a particular position falls in that range for that gene but hopefully it would be faster than this.

        Thanks again for all the help.

        Hi Graff one more thing about the result file being produced the number of lines having Gm10 assembly in Methylation.gtf are 26637 and thus the final result should have the same number with 0 value for those genes which have no CG CHG or CHH count while right now it only prints 8626 lines including the header. Sorry to trouble you about that but if you can let me know what changes should I make in the code to make it possible

        Thanks again

        Your's is a lot more faster than my code mine did not yield result even after running for six hours. Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962711]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2014-09-24 02:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (244 votes), past polls