in reply to Comparing and getting information from two large files and appending it in a new file
Since you seem to be using whole lines of the first file as hash keys, you could just load that hash while reading the first file. Since most of the effort when reading the second file is to locate a matching range for each line of data, the crux of the problem is to figure out how to do this as quickly as possible.
And, why do you use Tie::Autotie "Tie::IxHash"? Since you sort the keys to do your output, I think you could use just a plain old hash.
Given the sample data you've shown (which I think has a mistake: did you mean to put '#' at the start of both lines in the first file?), it's hard to tell what proper output should look like, because the posted data produces no output with the posted code. So, I can't really be sure whether the alternative below would do the right thing -- I can only tell that it produces the same result as the OP code when using the OP data samples.
I've taken the trouble to make it work with "use strict", just to show you that doing this really doesn't involve a lot of effort, and is well worth it. The logic is very different: build a couple hashes while reading the first file, and also create a sorted array of the "Start" values, to reduce the amount of work that needs to be done while reading the second file. Overall, this uses "split" a lot less often, and when looking to see which range a given INPUT line falls into, it stops looping over ranges as soon as the right one is found.
(UPDATE: Shortly after posting, I realized that my range-matching logic might produce lower counts in its output than yours -- in particular, if different "genome" records in the first file have overlapping ranges, and if a given input from the second file could match multiple ranges that happen to have different starting values, then my code will only increment the first matching range. It wouldn't be hard to fix, but if that condition never comes up, there's no need to fix it.)
If that produces output that isn't right, and if you can't grok how it works (or how it fails), you'll need to post some relevant data samples that produce output (both good and bad).#!/usr/bin/perl use strict; use Benchmark; my %methhash; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^Gm10/ ) { my ( $bgn, $end, $methcount ) = (split /\t/ )[3,4,10]; $methhash{$_}{methcount} = $methcount; $methrange{$bgn}{$end} = $_; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $strand ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 and $position < $sorted_meth_starts[$indx] ) { $indx--; } next if ( $indx < 0 ); # skip if $position is lower than start of +lowest range my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { next if ( $position > $end ); # skip if position is outside th +is range my $match = $methrange{$bgn}{$end}; next if ( $match !~ /^$meth/ ); # skip if assembly doesn't mat +ch if ( $strand =~ /^(CH?G|CHH)$/ ) { $methhash{$match}{$strand}++; } } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
|
---|