Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

fasta hash

by morio56 (Initiate)
on Aug 26, 2011 at 05:56 UTC ( #922520=perlquestion: print w/replies, xml ) Need Help??
morio56 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks! I am having a problem which looks simple on paper but is proving difficult than I'd wish it to. I have a fasta file (with ids and sequences e.g >123 656 nnnnnnnnnnnnnnn) and another file with just the fasta ids (e.g 123, 234, etc) I want to search in for the ids in the fasta file and return the corresponding sequences. I was hoping to read in the the fasta file and store it in hash with keys being the ids and values the sequences then search for the keys (ids from the id files) and return the sequences. However I've not been able to achieve this as the print out only gives me the last value but all the existing keys. Something tells me I need to not only use "if exist" but also ask perl to print the corresponding value for the existing key, I haven't been able to do this. Here is my code

if(@ARGV < 2){ die "Not enough arguments\n"; } $sequence=""; $fastaID=""; open(FILE1,"$ARGV[0]") or die "No fasta file provided in command line: + $!\n"; while ($line=<FILE1>){ chomp($line); if ($line=~/^\s*$/){ next; }elsif ($line=~/^>/){ @data=split(" ",$line); $fastahash{$fastaID}=$sequence; $sequence=""; $fastaID=$data[1]; }else{ $sequence .=$line; $sequence=~s/\s*//g; } $fastahash{$fastaID}=$sequence; } open(OUT,"$ARGV[1]") or die "No output file:$!\n"; open(FILE2,"$ARGV[2]") or die "No fasta file provided in command line: + $!\n"; while($line2=<FILE2>){ chomp($line2); if (exists $fastahash{$line2}){ print "$line2\t $sequence\n"; } } exit;
and here is an example of a fasta file am using
> 2056360012 1047627436237 yyyacgagchagshgashcgahcgac acsasasasacsacsasasacaca ascassacsaascascascascac > 2056360013 1047627436238 xxxxcgagchagshgashcgahcgac acsasasasacsacsasasacaca ascassacsaascascascascac

And a sample of the fasta ids file

2056360012 2056360013

Replies are listed 'Best First'.
Re: fasta hash
by moritz (Cardinal) on Aug 26, 2011 at 06:32 UTC

    It's usually more efficient to do it the other way round: First read the file with the IDs, store the IDs in a hash, and then go through the fasta file, and print each line if its ID appears in the hash. That way you have to store less data in memory.

    Regarding your code: Use strict and warnings, and indent the code properly, for example 4 characters for each opening bracket. It actually makes code readable. See perlstyle.

    @data=split(" ",$line); $fastahash{$fastaID}=$sequence;

    This is almost certainly wrong: the hash key ($fastaID) doesn't depend on $line, so whatever it is, it's not the current ID. First assign to $fastaID, then use it as a hash key.

      Thanks. But then what will be the value to the id key since te ids file only contain ids and nothing else?

        As moritz said, 1 (or ++) is a common choice of value, but it's not necessarily the best one. It doesn't usually matter these days for 100k elements, but it's better (thanks, Liz! ;) for memory size to do something like this:

        my ($undef); while (whatever...) { .... $hash{$key} = $undef; }

        This way each element points to the same $undef value. Otherwise, each element would point to a different copy of the value 1. That's a kind of "poor man's aliasing". For bonus points, you might look at Array::RefElem or Data::Alias.

Re: fasta hash
by sundialsvc4 (Abbot) on Aug 26, 2011 at 13:42 UTC

    It is also advantageous to “sort something.”   In other words, for example, use a disk-based sort to place the file of keys that you’ll be searching for into ascending order.   This greatly improves the odds that the next search will be able to exploit cached data from the previous one, perhaps avoiding many disk I/O’s altogether.

    I/O avoidance is the name of the game here.   Disk I/O is what’s slowing you down, specifically disk-drive seeks, where the read/write head assembly must be moved from one cylinder to another.   That takes many milliseconds, and it slows your entire algorithm down to “disk drive speeds.”   This is what you, perhaps, cannot circumvent completely, but are trying to minimize.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://922520]
Approved by moritz
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2018-02-17 23:43 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (250 votes). Check out past polls.