help speeding up my perl code

a217 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I need help speeding up my perl code and was hoping that someone could give me a few pointers.

My goal of this code is to print out lines from chr1.txt (see below) only when the ID (first column) is shared by chr1.coverage.txt as well.

Hence, I decided to save each unique ID from the "coverage" file as a hash key so that I could then sort through them and print out the corresponding lines from chr1.txt

The problem with this method I currently use is the speed of the program. I am using a nested foreach loop within a while loop, and when there are thousands of unique keys for the "coverage" file the process can take a long time.

Given that my data is well-sorted (there is only 1 chr for each file, in this case chr1, and the numbers are numerically sorted) I know that there should be a faster way to achieve the same results, but I have been struggling to find a working solution. Any help you can provide would be greatly appreciated.

Below is my current perl code

#!/usr/bin/perl -w
# perl.ID.match.pl
use strict; use warnings;

open(KEY, "<chr1.coverage.txt") or die "error reading file";
my %Chr;
my (@Key,@line);
while (my $key = <KEY>) {
    chomp $key;
    @Key=split("\t",$key);
    $Chr{$Key[0]} = undef; 
}


open(FULL, "<chr1.txt") or die "error reading file";
open(OUT, ">chr1.match.txt") or die "error reading file";

while (my $line=<FULL>)
{
    
    chomp $line;
    @line=split("\t",$line);
    foreach my $key (sort keys %Chr) 
    {
    
        if ($line[0] eq $key)
        {
        print OUT "$line\n";
        }
    }
}

close KEY;
close FULL;
close OUT;
exit;
[download]

This is chr1.coverage.txt sample data

chr1:496-588    44    0
chr1:496-588    44    0
chr1:496-588    44    0
chr1:496-588    44    0
chr1:496-588    44    0
chr1:496-588    44    0
chr1:496-588    43    0
chr1:496-588    43    0
chr1:496-588    39    0
chr1:496-588    39    0
chr1:496-588    38    0
chr1:496-588    37    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0.0892857
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
[download]

This is chr1.txt sample data

chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    43    0
chr1:496-588    43    0
chr1:496-588    39    0
chr1:496-588    39    0
chr1:496-588    38    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    39    0
chr1:496-588    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:648-719    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:720-795    0    0
chr1:126982-127030    0    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0.0892857
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
[download]

This is the sample output chr1.match.txt

chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    43    0
chr1:496-588    43    0
chr1:496-588    39    0
chr1:496-588    39    0
chr1:496-588    38    0
chr1:496-588    0    0
chr1:496-588    0    0
chr1:496-588    39    0
chr1:496-588    0    0
chr1:126982-127030    0    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0.0892857
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    56    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
chr1:126982-127030    0    0
[download]

Also note that although the data may look unnecessarily repetitious, it is actually a cropped version of another file with more columns so that each specific line represents a different data point.

Comment on help speeding up my perl code Select or Download Code

Replies are listed 'Best First'.
Re: help speeding up my perl code by BrowserUk (Patriarch) on Aug 31, 2011 at 20:12 UTC
You've completely missed the point of using a hash. You do not need to iterate every key to see if it contains the value you want, you just use the value as the key. Try this. It should (untested) produce the same output and run much, much (much:), faster: #!/usr/bin/perl -w # perl.ID.match.pl use strict; use warnings; open(KEY, "<chr1.coverage.txt") or die "error reading file"; my %Chr; my (@Key,@line); while (my $key = <KEY>) { chomp $key; @Key=split("\t",$key); $Chr{$Key[0]} = undef; } open(FULL, "<chr1.txt") or die "error reading file"; open(OUT, ">chr1.match.txt") or die "error reading file"; while (my $line=<FULL>) { chomp $line; @line=split("\t",$line); if( exists $Chr{ $line[0] } ) { print OUT "$line\n"; } } close KEY; close FULL; close OUT; exit; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: help speeding up my perl code by MidLifeXis (Monsignor) on Aug 31, 2011 at 20:12 UTC
Instead of setting `$Chr{$key} = undef`, set it to `1`. Then, instead of looping through each key in the hash to find a match, you can just do a `if ($Chr{$line[0]})`. I only glanced over your code, so I may have missed something that would stop this from working, but there is a definite red flag when you are looping over the keys in a hash and comparing each one to some value. --MidLifeXis	[reply] [d/l] [select]
Re^2: help speeding up my perl code by SuicideJunkie (Vicar) on Sep 01, 2011 at 13:10 UTC
Or, keep setting it to undef, but test with an `if (exists $hash{$key})`	[reply] [d/l]
Re: help speeding up my perl code by jwkrahn (Abbot) on Aug 31, 2011 at 20:27 UTC
Try something like this (UNTESTED): `#!/usr/bin/perl # perl.ID.match.pl use strict; use warnings; open my $KEY, '<', 'chr1.coverage.txt' or die "error reading 'chr1.cov +erage.txt' because: $!"; my $pattern = join '\|', map /^([^\t]+)/ ? "\Q$1\E" : (), <$KEY>; close $KEY; open my $FULL, '<', 'chr1.txt' or die "error reading 'chr1.txt' becaus +e: $!"; open my $OUT, '>', 'chr1.match.txt' or die "error reading 'chr1.match. +txt' because: $!"; while ( <$FULL> ) { print $OUT $_ if /^(?:$pattern)\t/; } close $FULL; close $OUT; exit 0;` [download]	[reply] [d/l]
Re: help speeding up my perl code by thezip (Vicar) on Aug 31, 2011 at 20:10 UTC
This looks like a job for a profiler! Try Devel::NYTProf on for size... What can be asserted without proof can be dismissed without proof. - Christopher Hitchens	[reply]
Re^2: help speeding up my perl code by a217 (Novice) on Aug 31, 2011 at 20:18 UTC
That seems like an interesting package, but I already know that the slow part is the foreach loop within the while loop.	[reply]
Re^2: help speeding up my perl code by BrowserUk (Patriarch) on Aug 31, 2011 at 20:13 UTC
You do not need an over-engineered profiler to see the obvious. Try using your eyes first!	[reply]
Re^3: help speeding up my perl code by thezip (Vicar) on Aug 31, 2011 at 20:21 UTC
BrowserUK, What may be obvious to your well-experienced eyesight might not be within the realm of experience of a relatively new Perl programmer. It was with humble intentions that I offered a vehicle to broaden their Perl exposure... Fair enough? What can be asserted without proof can be dismissed without proof. - Christopher Hitchens	[reply]


Keep It Simple, Stupid
	PerlMonks