Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

help speeding up my perl code

by a217 (Novice)
on Aug 31, 2011 at 20:01 UTC ( [id://923494]=perlquestion: print w/replies, xml ) Need Help??

a217 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I need help speeding up my perl code and was hoping that someone could give me a few pointers.

My goal of this code is to print out lines from chr1.txt (see below) only when the ID (first column) is shared by chr1.coverage.txt as well.

Hence, I decided to save each unique ID from the "coverage" file as a hash key so that I could then sort through them and print out the corresponding lines from chr1.txt

The problem with this method I currently use is the speed of the program. I am using a nested foreach loop within a while loop, and when there are thousands of unique keys for the "coverage" file the process can take a long time.

Given that my data is well-sorted (there is only 1 chr for each file, in this case chr1, and the numbers are numerically sorted) I know that there should be a faster way to achieve the same results, but I have been struggling to find a working solution. Any help you can provide would be greatly appreciated.

Below is my current perl code

#!/usr/bin/perl -w # perl.ID.match.pl use strict; use warnings; open(KEY, "<chr1.coverage.txt") or die "error reading file"; my %Chr; my (@Key,@line); while (my $key = <KEY>) { chomp $key; @Key=split("\t",$key); $Chr{$Key[0]} = undef; } open(FULL, "<chr1.txt") or die "error reading file"; open(OUT, ">chr1.match.txt") or die "error reading file"; while (my $line=<FULL>) { chomp $line; @line=split("\t",$line); foreach my $key (sort keys %Chr) { if ($line[0] eq $key) { print OUT "$line\n"; } } } close KEY; close FULL; close OUT; exit;

This is chr1.coverage.txt sample data

chr1:496-588 44 0 chr1:496-588 44 0 chr1:496-588 44 0 chr1:496-588 44 0 chr1:496-588 44 0 chr1:496-588 44 0 chr1:496-588 43 0 chr1:496-588 43 0 chr1:496-588 39 0 chr1:496-588 39 0 chr1:496-588 38 0 chr1:496-588 37 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0.0892857 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0

This is chr1.txt sample data

chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 43 0 chr1:496-588 43 0 chr1:496-588 39 0 chr1:496-588 39 0 chr1:496-588 38 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 39 0 chr1:496-588 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:648-719 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:720-795 0 0 chr1:126982-127030 0 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0.0892857 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0

This is the sample output chr1.match.txt

chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 43 0 chr1:496-588 43 0 chr1:496-588 39 0 chr1:496-588 39 0 chr1:496-588 38 0 chr1:496-588 0 0 chr1:496-588 0 0 chr1:496-588 39 0 chr1:496-588 0 0 chr1:126982-127030 0 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0.0892857 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 56 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0 chr1:126982-127030 0 0

Also note that although the data may look unnecessarily repetitious, it is actually a cropped version of another file with more columns so that each specific line represents a different data point.

Replies are listed 'Best First'.
Re: help speeding up my perl code
by BrowserUk (Patriarch) on Aug 31, 2011 at 20:12 UTC

    You've completely missed the point of using a hash. You do not need to iterate every key to see if it contains the value you want, you just use the value as the key.

    Try this. It should (untested) produce the same output and run much, much (much:), faster:

    #!/usr/bin/perl -w # perl.ID.match.pl use strict; use warnings; open(KEY, "<chr1.coverage.txt") or die "error reading file"; my %Chr; my (@Key,@line); while (my $key = <KEY>) { chomp $key; @Key=split("\t",$key); $Chr{$Key[0]} = undef; } open(FULL, "<chr1.txt") or die "error reading file"; open(OUT, ">chr1.match.txt") or die "error reading file"; while (my $line=<FULL>) { chomp $line; @line=split("\t",$line); if( exists $Chr{ $line[0] } ) { print OUT "$line\n"; } } close KEY; close FULL; close OUT; exit;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: help speeding up my perl code
by MidLifeXis (Monsignor) on Aug 31, 2011 at 20:12 UTC

    Instead of setting $Chr{$key} = undef, set it to 1. Then, instead of looping through each key in the hash to find a match, you can just do a if ($Chr{$line[0]}). I only glanced over your code, so I may have missed something that would stop this from working, but there is a definite red flag when you are looping over the keys in a hash and comparing each one to some value.

    --MidLifeXis

      Or, keep setting it to undef, but test with an if (exists $hash{$key})

Re: help speeding up my perl code
by jwkrahn (Abbot) on Aug 31, 2011 at 20:27 UTC

    Try something like this (UNTESTED):

    #!/usr/bin/perl # perl.ID.match.pl use strict; use warnings; open my $KEY, '<', 'chr1.coverage.txt' or die "error reading 'chr1.cov +erage.txt' because: $!"; my $pattern = join '|', map /^([^\t]+)/ ? "\Q$1\E" : (), <$KEY>; close $KEY; open my $FULL, '<', 'chr1.txt' or die "error reading 'chr1.txt' becaus +e: $!"; open my $OUT, '>', 'chr1.match.txt' or die "error reading 'chr1.match. +txt' because: $!"; while ( <$FULL> ) { print $OUT $_ if /^(?:$pattern)\t/; } close $FULL; close $OUT; exit 0;
Re: help speeding up my perl code
by thezip (Vicar) on Aug 31, 2011 at 20:10 UTC

    This looks like a job for a profiler!

    Try Devel::NYTProf on for size...


    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens

      That seems like an interesting package, but I already know that the slow part is the foreach loop within the while loop.

      You do not need an over-engineered profiler to see the obvious. Try using your eyes first!

        BrowserUK,

        What may be obvious to your well-experienced eyesight might not be within the realm of experience of a relatively new Perl programmer.

        It was with humble intentions that I offered a vehicle to broaden their Perl exposure...

        Fair enough?


        What can be asserted without proof can be dismissed without proof. - Christopher Hitchens

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://923494]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-24 02:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found