Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

renaming 1000's of FASTA files

by garyboyd (Acolyte)
on Jul 11, 2011 at 11:20 UTC ( #913678=perlquestion: print w/ replies, xml ) Need Help??
garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I have a list of 121,000 fasta headers in a .txt file and I want to compare this to a multiple FASTA file containing 132,000 sequences. I have written a script to do this but it is too slow!!! The script puts all the lines from the .txt file in a hash and then compares the header of each FASTA sequence in the multiple FASTA file with the keys in the hash and if they match prints out a new FASTA sequence with the renamed header from the .txt file. Is there anything sensible I can do to speed this up?

Input txt file looks like:

F0Z7V0F01A03EB_210 F0Z7V0F01A03EB_180 F0Z7V0F01A03EB_136 F0Z7V0F01A03EB_362 F0Z7V0F01A03GP_298 F0Z7V0F01A03GP_205 etc...........

Input FASTA looks like:

>GJKKTUG01DYDGC GGGTATTCCTTCTCCACCTTGCAGCTAACATCAGTGTTTCGTCTACTCAAGCACGCCAAC ACGCCCTAGAGCGCCCTGTCCAGGGGATGGCAACCAACTCTGACCCTGCAAGTGCAGCAG ACATGAGGAATACAAACTACAATCTTTTACTTGATGATGCAATGCCGGACAAACTCTAGA >F0Z7V0F01EDB3V AAGGCGAGNGGTATCACGCAGTAAGTTACGGTTTTCGGGTAACGCGTCNGNGGNACTAAC CCACGGNGGGTAACCCGTCNCTACCGGTATAGGACTAAGGTTACCGGAACGTCGTGGGGT ACCCCCCGGACGGGGACCGTCCCCTCATANAGTCAACNGTNTGAGATGGACTAACTCAAA CCTAGTTTCAAGTACTATTTAACTTACTTACGTTACCCGTAATTTCGGCGTTTAGAGGCG etc....................

My script which is working VERY slowly is :

#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; use Data::Dumper; my %seq_id; my $fasta_id; open HEADER , "<FASTA.headers" or die $!; while (<HEADER>){ chomp $_; #print $_."\n"; $fasta_id = $_; $fasta_id =~ s/_.*//g ; #print $fasta_id."\n"; %seq_id = ("$fasta_id" => "$_"); # print Dumper (\%seq_id); my $infile=$ARGV[0] || die ("Please give me an input fasta + file\n"); my $inseq = new Bio::SeqIO(-format => 'fasta', -file => $infile); while (my $seq_obj = $inseq->next_seq ) { my $id = $seq_obj->id ; chomp $id; # print $id."\n"; my $seq = $seq_obj->seq ; if (exists ($seq_id{$id})) { print ">"; print $seq_id{$fasta_id}; print "\n".$seq."\n"; } } }

Comment on renaming 1000's of FASTA files
Select or Download Code
Re: renaming 1000's of FASTA files
by moritz (Cardinal) on Jul 11, 2011 at 11:43 UTC
    Is there anything sensible I can do to speed this up?

    Stop reading the same file over and over again. Maybe something like this can help you (untested):

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; use Data::Dumper; my %seq_id; open HEADER , "<FASTA.headers" or die $!; while (<HEADER>){ chomp $_; my $fasta_id = $_; $fasta_id =~ s/_.*//g ; $seq_id{$fasta_id} => $_; } my $infile = $ARGV[0] || die ("Please give me an input fasta file\n"); my $inseq = new Bio::SeqIO(-format => 'fasta', -file => $infile); while (my $seq_obj = $inseq->next_seq ) { my $id = $seq_obj->id ; chomp $id; my $seq = $seq_obj->seq ; if (exists ($seq_id{$id})) { print ">"; print $seq_id{$fasta_id}; print "\n".$seq."\n"; } }
      Shouldn't this code:

      if (exists ($seq_id{$id})) { print ">"; print $seq_id{$fasta_id}; print "\n".$seq."\n"; }

      be

      if (exists ($seq_id{$id})) { print ">"; print $seq_id{$id}; print "\n".$seq."\n"; }
      and

      $seq_id{$fasta_id} => $_; be $seq_id{$fasta_id} = $_;

      Update: Corrected print $seq_id{$id}; from print $id

Re: renaming 1000's of FASTA files
by Anonymous Monk on Jul 11, 2011 at 12:05 UTC
    If the total data size will sensibly -fit- in memory at the same time, then you can use a hash as you are doing now. Otherwise sort the files identically and write code that compares the two sorted files. Or ... put the data into an SQLite database file, which is a flat-file requiring no server at all, and use queries.

      Thanks anonymous monk, would using index instead of the regex speed things up noticeably? I decided to split the input file and run multiple processes.

Re: renaming 1000's of FASTA files
by osbosb (Monk) on Jul 11, 2011 at 12:50 UTC
    One thing you should always make sure is that for every open file handle you create, you should have a corresponding close. It's good practice. Sorry I couldn't be more helpful.

      How is this good practice in Perl?

      I'm genuinely asking this. I have found some good practices for me, like for example that every if block should also have an else block, even if that block just dies, because in the long run, I will need code in that block anyway. But I don't see what bugs closing a file prevents or what diagnostics it helps to create. I neglect checking the result of close(), but that's about the only "benefit" I see.

        But I don't see what bugs closing a file prevents

        Running out of file descriptors?

        $ cat /proc/sys/fs/file-max 386943

        citromatik

        I neglect checking the result of close(), but that's about the only "benefit" I see.

        close() is where you get the most useful information error-checking wise, especially on writing. Some people I've worked with have even gone as far as suggesting not to error check the opens or writes, just checking close() for efficiency and completeness.

        --Dave

Re: renaming 1000's of FASTA files
by Cristoforo (Deacon) on Jul 11, 2011 at 16:53 UTC
    The script puts all the lines from the .txt file in a hash

    It just keeps creating a new 1 element hash

    %seq_id = ("$fasta_id" => "$_");

    you would need to say instead
    $seq_id{ $fasta_id } = $_;

    However, moritz has shown the better solution (instead of reopening the file over and over).

Re: renaming 1000's of FASTA files
by tokpela (Chaplain) on Jul 11, 2011 at 17:29 UTC

    I would also add to the above posts that if you start to run into memory issues with the hash you are using, use DBM::Deep

Re: renaming 1000's of FASTA files
by sundialsvc4 (Monsignor) on Jul 11, 2011 at 20:30 UTC

    I agree that SQLite is quite probably “the thing to do” here.   I would put each of the sequences into a table, along with (in a separate column) whatever key value you happen to be looking for ... whatever (I’m not a biologist...) “identifies” the sequence, whether or not it is unique.

    If there are many keys that might identify a particular sequence, “okay, big deal, you have a many-to-one relationship.”

    Anyhow ... SQLite gives you a quite-robust SQL implementation, in a single file, without a server.   (Really, the only “gotcha” that it has is transactions:   when you are changing data, you need to be in a transaction while doing so, or you’ll really regret it speed-wise.)

    Now, build non-unique indexes.

    Once you have the data in this database form, I suspect that a lot of your “difficult processing” will be reduced to SQL queries that you might not even require a Perl program to run.

    SQLite is quite the tool ... I really beat the heck out of it once, or tried to.   It didn’t flinch.   31 million rows at one point, and it still wasn’t flinching.

Re: renaming 1000's of FASTA files
by Cristoforo (Deacon) on Jul 11, 2011 at 21:13 UTC
    F0Z7V0F01A03EB_210 F0Z7V0F01A03EB_180 F0Z7V0F01A03EB_136 F0Z7V0F01A03EB_362
    After closer examination of the text file with id's, I see that the same id has 4 different endings, _210, _180, _136, _362. Is this so or is it only how you provided the sample input file? If it is this way, how would you choose the one to use for an id of F0Z7V0F01A03EB?

      Yes that's correct, in the .txt file there are id's with 4 different endings, so the script would scan the multiple fasta file and generate (in this example) 4 files with headers:

      F0Z7V0F01A03EB_210 F0Z7V0F01A03EB_180 F0Z7V0F01A03EB_136 F0Z7V0F01A03EB_362

      and the sequence would be the same for all files, ie the sequence from the fasta file F0Z7V0F01A03EB

        You realize you have a problem here?

        You keep a hash with key the ID without the ending of "_xxx" and value the ID with the ending. But a hash-element can have only ONE value, so you overwrite the older elements with the newer, unless you take care and store the IDs in an array and then store the arrayref as a value in the hash-element.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://913678]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (10)
As of 2014-07-22 21:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (129 votes), past polls