http://www.perlmonks.org?node_id=922520

morio56 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks! I am having a problem which looks simple on paper but is proving difficult than I'd wish it to. I have a fasta file (with ids and sequences e.g >123 656 nnnnnnnnnnnnnnn) and another file with just the fasta ids (e.g 123, 234, etc) I want to search in for the ids in the fasta file and return the corresponding sequences. I was hoping to read in the the fasta file and store it in hash with keys being the ids and values the sequences then search for the keys (ids from the id files) and return the sequences. However I've not been able to achieve this as the print out only gives me the last value but all the existing keys. Something tells me I need to not only use "if exist" but also ask perl to print the corresponding value for the existing key, I haven't been able to do this. Here is my code

if(@ARGV < 2){ die "Not enough arguments\n"; } $sequence=""; $fastaID=""; open(FILE1,"$ARGV[0]") or die "No fasta file provided in command line: + $!\n"; while ($line=<FILE1>){ chomp($line); if ($line=~/^\s*$/){ next; }elsif ($line=~/^>/){ @data=split(" ",$line); $fastahash{$fastaID}=$sequence; $sequence=""; $fastaID=$data[1]; }else{ $sequence .=$line; $sequence=~s/\s*//g; } $fastahash{$fastaID}=$sequence; } open(OUT,"$ARGV[1]") or die "No output file:$!\n"; open(FILE2,"$ARGV[2]") or die "No fasta file provided in command line: + $!\n"; while($line2=<FILE2>){ chomp($line2); if (exists $fastahash{$line2}){ print "$line2\t $sequence\n"; } } exit;
and here is an example of a fasta file am using
> 2056360012 1047627436237 yyyacgagchagshgashcgahcgac acsasasasacsacsasasacaca ascassacsaascascascascac > 2056360013 1047627436238 xxxxcgagchagshgashcgahcgac acsasasasacsacsasasacaca ascassacsaascascascascac

And a sample of the fasta ids file

2056360012 2056360013

Replies are listed 'Best First'.
Re: fasta hash
by moritz (Cardinal) on Aug 26, 2011 at 06:32 UTC

    It's usually more efficient to do it the other way round: First read the file with the IDs, store the IDs in a hash, and then go through the fasta file, and print each line if its ID appears in the hash. That way you have to store less data in memory.

    Regarding your code: Use strict and warnings, and indent the code properly, for example 4 characters for each opening bracket. It actually makes code readable. See perlstyle.

    @data=split(" ",$line); $fastahash{$fastaID}=$sequence;

    This is almost certainly wrong: the hash key ($fastaID) doesn't depend on $line, so whatever it is, it's not the current ID. First assign to $fastaID, then use it as a hash key.

      Thanks. But then what will be the value to the id key since te ids file only contain ids and nothing else?

        As moritz said, 1 (or ++) is a common choice of value, but it's not necessarily the best one. It doesn't usually matter these days for 100k elements, but it's better (thanks, Liz! ;) for memory size to do something like this:

        my ($undef); while (whatever...) { .... $hash{$key} = $undef; }

        This way each element points to the same $undef value. Otherwise, each element would point to a different copy of the value 1. That's a kind of "poor man's aliasing". For bonus points, you might look at Array::RefElem or Data::Alias.

A reply falls below the community's threshold of quality. You may see it by logging in.