Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

printing fasta sequences from Hash

by Bijgom (Initiate)
on Feb 24, 2015 at 16:05 UTC ( [id://1117673]=perlquestion: print w/replies, xml ) Need Help??

Bijgom has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a problem and I would greatly appreciate your wisdom to solve it. So, I am reading from a fasta file (containing protein seqeunces) while looping over and trying to print protein sequences. My problem is: When trying to print each protein sequence that I have in my hash, the sequences are broken down as follows:
Protein sequence:STARTXXXXX Protein Sequence: XXXXXXXXX Protein sequence: XXXXXXEND
Question: How could I avoid this? Here is my script:
use warnings; my %database; my $protein_id = ""; my $line = ""; my $datbase_filename=""; my $start=">DDB"; my $end=""; #input1: open the file containg the protein sequences. print"Please enter the path to the fasta file of protein sequences \n" +; #put the file name into $filenam $database_filename=<STDIN>; #remove newline chomp($database_filename); #open the file containing protein sequences in fasta format open(MYFILE1,"<$database_filename") or die "Unable to open $database_f +ilename: $!\n"; while(<MYFILE1>) { #remove newline chomp; $line=$_; #print "Line i: $_\n"; #test if (($line=~m/^>DDB/) and ($line=~ m/to\s\d+$/)) { $protein_id = $line; #print "protein ID: $protein_id\n";#test } else { $proteins{$protein_id} = $line;# associating the prote +in sequence in my %proteins to its ID print "sequence: $line\n";#test } } exit;
Many thanks in advance.

Replies are listed 'Best First'.
Re: printing fasta sequences from Hash
by biohisham (Priest) on Feb 25, 2015 at 03:14 UTC

    Your problem is that each fasta sequence spans multiple lines. Knowing that the fasta header is ">" you can use that to your advantage by telling your program to split around the fasta separator and slurp all that is in between. Another neater solution is achievable through Bio::SeqIO which requires BioPerl installation on your system. So for the first case, slurping each fasta record the following code gets you there with few modifications, improtantly, note the default separator variable $/was locally modified to ">"

    use strict; use warnings; use Data::Dumper; local $/=">"; while(my $fasta_record=<DATA>){ chomp $fasta_record; $fasta_record =~ s/(^>*.+\n)/$1: /; #add a space after the fasta +header $fasta_record =~ s/\n//g; # remove endlines print $fasta_record,"\n"; } __DATA__ >protein1 WWWWTCTG TTWTTTCT TTWWWC >protein2 WWWWTCTG TTWWWC TTWTTTCT >protein3 TTWWWC WWWWTCTG TTWTTTCT

    I prefer using a BioPerl module to do this kind of tasks since I can get more time to focus on the more important stuff of doing something or the other with the sequences rather than grapple with how to read them in

    Assuming that your sequences are in a fasta file called database_filename.fa

    use strict; use warnings; use Bio::SeqIO; #initialize a bioperl object instance my $in = Bio::SeqIO->new(-file=> "database_filename.fa", -format=>"fas +ta"); while(my $seq = $in->next_seq){ #read the sequences in the bioperl +object print $seq->id," ",$seq->seq; #output,ID space-separated from sequ +ences print "\n"; }


    A 4 year old monk

      Many thanks bioisham for kindly replying to my post. I will indeed install BioPerl. Hope it will make things smother. Regards, B

Re: printing fasta sequences from Hash
by Laurent_R (Canon) on Feb 24, 2015 at 18:46 UTC
    Please modify your post to put your code and data samples between opening  <code> and closing  </code> tags. This will make your post readable to human eye.

    Je suis Charlie.
Re: printing fasta sequences from Hash
by kroach (Pilgrim) on Feb 24, 2015 at 21:04 UTC
    A more readable version of the question:

    Hello Monks, I have a problem and I would greatly appreciate your wisdom to solve it. So, I am reading from a fasta file (containing protein seqeunces) while looping over and trying to print protein sequences.

    My problem is: When trying to print each protein sequence that I have in my hash, the sequences are broken down as follows:

    Protein sequence:STARTXXXXX Protein Sequence: XXXXXXXXX Protein sequence: XXXXXXEND

    Question: How could I avoid this?

    Here is my script:

    use warnings; my %database; my $protein_id = ""; my $line = ""; my $datbase_filename = ""; my $start = ">DDB"; my $end = ""; #input1: open the file containg the protein sequences. print "Please enter the path to the fasta file of protein sequences\n" +; #put the file name into $filenam $database_filename = <STDIN>; chomp($database_filename); #remove newline #open the file containing protein sequences in fasta format open(MYFILE1, "<$database_filename") or die "Unable to open $database_filename: $!\n"; while (<MYFILE1>) { chomp; #remove newline $line = $_; print "Line i: $_\n"; #test if (($line =~ m/^>DDB/) and ($line =~ m/to\s\d+$/)) { $protein_id = $line; print "protein ID: $protein_id\n"; #test } else { # associating the protein sequence in my %proteins to its ID $proteins{$protein_id} = $line; print "sequence: $line\n"; #test } } exit;
    Many thanks in advance.

      kroach: I appreciate you're trying to be helpful, but I think it would have been better if Bijgom had re-formatted the OP him- or herself.


      Give a man a fish:  <%-(-(-(-<

      Sorry guys for my mistake...And many thanks Kroach for editing my post...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1117673]
Approved by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-19 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found