http://www.perlmonks.org?node_id=1050422


in reply to Searching array against hash

I feel that there should be a more efficient way

There is. Your current process is ~O(900,000*60,000) = 54 billion.

If you reverse the logic of your lookups, by reading the list file into a hash first; you can then read the fasta file line by line and lookup the headers in the hash, which results in an O( 900,000 ) process.

Try this (untested) version which should run close to 60,000 times faster:

#!/usr/bin/perl use strict; use warnings; use diagnostics; use FAlite; die "usage: $0 <fasta> <list>\n" unless @ARGV == 2; my %list; open(LIST, $ARGV[1]) or die; while (<LIST>) { next if /^#/; my ($header, $score) = split; $list{ $header } = $score; } close LIST; open(FASTA, $ARGV[0]) or die; my $fasta = new FAlite(\*FASTA); while (my $entry = $fasta->nextEntry) { if( exists $list{ $entry->def } ) { print $entry->def; print $entry->seq; } } close FASTA; __END__

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Searching array against hash
by drhicks (Novice) on Aug 21, 2013 at 21:54 UTC

    Wow thanks, only takes a couple seconds to finish! I had attempted to do the same thing, but could never get it working, and wasn't sure if/how it would actually increase the speed. Thanks again

      The "if/how" is this:

      Your first solution had an outer loop that runs 900,000 times. Inside that outer loop, there's an inner loop that runs 60,000 times. 60k * 900k is 54 billion total iterations inside the inner loop.

      The proposed solution created a hash of 60000 elements. Then your 900000 line file is read line by line. Inside of that loop that iterates 900000 times, there's a hash lookup, which is almost free. There are a total of 60000 iterations needed to build the hash, and 900000 iterations needed to test each line of the FASTA file. The amount of work being done is, therefore, 960000 iterations.

      Think of loops inside of loops as doing n*m amount of work, whereas loops followed by loops (no nesting) do n+m amount of work. Anytime you have the choice of an algorithm where the order of growth is the mathematical product of two large numbers, or an algorithm where the growth rate is the mathematical sum of the same two numbers, your sense of economic utility should be telling you that the latter will scale up better.


      Dave