Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Well, your first problem is that you are doing a lot of the same work over and over again on the @genome array every time you read a line from INPUT (the "mc_11268_10.txt" file). You should build a data structure for the "genome" information as you read the data from the first file, so that you can use the data more efficiently while reading the second file.

Since you seem to be using whole lines of the first file as hash keys, you could just load that hash while reading the first file. Since most of the effort when reading the second file is to locate a matching range for each line of data, the crux of the problem is to figure out how to do this as quickly as possible.

And, why do you use Tie::Autotie "Tie::IxHash"? Since you sort the keys to do your output, I think you could use just a plain old hash.

Given the sample data you've shown (which I think has a mistake: did you mean to put '#' at the start of both lines in the first file?), it's hard to tell what proper output should look like, because the posted data produces no output with the posted code. So, I can't really be sure whether the alternative below would do the right thing -- I can only tell that it produces the same result as the OP code when using the OP data samples.

I've taken the trouble to make it work with "use strict", just to show you that doing this really doesn't involve a lot of effort, and is well worth it. The logic is very different: build a couple hashes while reading the first file, and also create a sorted array of the "Start" values, to reduce the amount of work that needs to be done while reading the second file. Overall, this uses "split" a lot less often, and when looking to see which range a given INPUT line falls into, it stops looping over ranges as soon as the right one is found.

(UPDATE: Shortly after posting, I realized that my range-matching logic might produce lower counts in its output than yours -- in particular, if different "genome" records in the first file have overlapping ranges, and if a given input from the second file could match multiple ranges that happen to have different starting values, then my code will only increment the first matching range. It wouldn't be hard to fix, but if that condition never comes up, there's no need to fix it.)

#!/usr/bin/perl use strict; use Benchmark; my %methhash; my %methrange; my $t0 = Benchmark->new; open(IN1,"Methylation.gtf"); while (<IN1>) { chomp; if ( /^Gm10/ ) { my ( $bgn, $end, $methcount ) = (split /\t/ )[3,4,10]; $methhash{$_}{methcount} = $methcount; $methrange{$bgn}{$end} = $_; } } my @sorted_meth_starts = sort {$a<=>$b} keys %methrange; open(INPUT, "mc_11268_10.txt"); while(<INPUT>) { next unless /^(\d+)\t(\d+)\t\S\t(\w+)/; # skip header line and an +y blank lines chomp; my ( $meth, $position, $strand ) = ( "Gm$1", $2, $3 ); my $indx = $#sorted_meth_starts; while ( $indx >= 0 and $position < $sorted_meth_starts[$indx] ) { $indx--; } next if ( $indx < 0 ); # skip if $position is lower than start of +lowest range my $bgn = $sorted_meth_starts[$indx]; for my $end ( keys %{$methrange{$bgn}} ) { next if ( $position > $end ); # skip if position is outside th +is range my $match = $methrange{$bgn}{$end}; next if ( $match !~ /^$meth/ ); # skip if assembly doesn't mat +ch if ( $strand =~ /^(CH?G|CHH)$/ ) { $methhash{$match}{$strand}++; } } } open(FILEIT,">res_11268_10.txt"); print FILEIT join("\t",qw(Chromosome Gene Feature Start END CG_count_b +y_C CHG_count_by_C CHH_count_by_C)),"\n"; for my $ge (sort keys %methhash) { my ( $chh, $chg, $cg, $total_count ); if ( $methhash{$ge}{methcount} == 0 ) { $chh = $chg = $cg = "INF"; } else { my $denom = $methhash{$ge}{methcount}; $chh = sprintf( "%.3f", $methhash{$ge}{CHH} / $denom ); $chg = sprintf( "%.3f", $methhash{$ge}{CHG} / $denom ); $cg = sprintf( "%.3f", $methhash{$ge}{CG} / $denom ); $total_count = sprintf( "%.3f", ( $methhash{$ge}{CHH}+$methhas +h{$ge}{CHG}+$methhash{$ge}{CG}) / $denom ); } print FILEIT join( "\t", ( split /\t/, $ge )[0..4], $cg, $chg, $ch +h ),"\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
If that produces output that isn't right, and if you can't grok how it works (or how it fails), you'll need to post some relevant data samples that produce output (both good and bad).

In reply to Re: Comparing and getting information from two large files and appending it in a new file by graff
in thread Comparing and getting information from two large files and appending it in a new file by perlkhan77

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and a log crumbles through the grate...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (9)
    As of 2018-03-18 21:14 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (230 votes). Check out past polls.