Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

PROTEIN FILE help me pleaseee

by serafinososi
on Jan 30, 2013 at 10:52 UTC ( #1016025=perlquestion: print w/ replies, xml ) Need Help??
serafinososi has asked for the wisdom of the Perl Monks concerning the following question:

I need some help pleaseeeee: the question is... Develop a Perl program that ask the user to type the name of a file in the working directory (containing one protein sequence in FASTA format) and prints the number of Phenylalanine aminoacids in the sequence. -It prints a message, prompting the user to insert the file name. -It reads the file name assigning it to one variable. -It opens for reading the file with that name. If the file is not existing it gives an error message. -A scalar variable that will contain the sequence is initialized to the empty string “”. -The program reads one after the other (with a while loop) the lines of the file assigning the read line to a variable. - If the line starts with “>” (it is the first line of a FASTA file) the line is not considered. Else all the white spaces are removed from the line and the line is postponed to the sequence. -When all the lines have been read the while loops terminates and the program continues closing the file. -Using the translate command, the program counts the number of F in the sequence, assigning it to a variable. -It prints the message: “The aminoacid sequence ” followed by the variable containing the read sequence “ contains “ followed by the variable with the number of F, followed by “Phenylalanine aminoacids” followed by new line.

This is the "protein file": >gi|403369491|gb|EJY84591.1| Transcriptional regulator, Sir2 family protein Oxytricha trifallax MMKQLIKHNKNTPLFNFLRVKFSSTAATIQTQQTVNKPIESKFKEEKLDNYHDIYEKSKRLAEQISQSKS FICFTGAGLSTSTGIPDYRSTSNTLAQTGAGAYELEISEEDKKSKTRQIRSQVQRAKPSISHMALHALME NGYLKHLISQNTDGLHLKSGIPYQNLTELHGNTTVEYCKSCSKIYFRDFRCRSSEDPYHHLTGRQCEDLK CGGELADEIVHFGESIPKDKLVEALTAASQSDLCLTMGTSLRVKPANQIPIQTIKNKGQLAIVNLQYTPF DEIAQIRMHSFTDQVLEIVCQELNIKIPEYQMKRRIHIIRNAETNEIVVYGSYGNHKNIKLSFMQRMEYI DNKNHVYLALDKEPFHIIPDYFNFQNINTDQEEVEFRIHFYGHNSEPYFQLTLPRQSILELQAGEHLICD ITFDYDKLEWK I wrote this but doesn't work:

#!/usr/bin/perl -w print "Please type the file name of the protein sequence data: "; $proteinfilename = <STDIN>; unless ( open (PROTEINFILE, $proteinfilename) ) { print "File \"$proteinfilename\" doesn\'t seem to exist!!\n"; } $protein = <PROTEINFILE> ; $empty = " " , (<PROTEINFILE>); while ( $protein = <PROTEINFILE> ) { if ( $protein =~ /^>/ ) { next $protein; } else { $protein =~ s/$empty//g ; $protein = join ($protein, @protein); print $protein; } }

Update: Thx

Comment on PROTEIN FILE help me pleaseee
Download Code
Re: PROTEIN FILE help me pleaseee (homework)
by Anonymous Monk on Jan 30, 2013 at 11:08 UTC
Re: PROTEIN FILE help me pleaseee
by choroba (Abbot) on Jan 30, 2013 at 12:19 UTC
    What do you mean by "doesn't work"? Is the number of F's too big? Does the program output any error messages?

    When glaring at your script, I can see that

    • you do not chomp the $proteinfilename after reading it from STDIN. The real file name probably does not contain a newline at the end.
    • if the open is not successful, your program continues. Are you really interested in the results, if the file cannot be found? Use the idiom
      open my $FH, '<', $filename or die "Cannot open: $!";
    • next takes a label as a parameter. There are no labels in your code.
    • you do not indent the code. Once you are advanced enough to use loops, please also learn to indent the code. Without indentation, your code becomes write-only, unreadable mess for humans.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: PROTEIN FILE help me pleaseee
by 2teez (Priest) on Jan 30, 2013 at 14:58 UTC

    Hi serafinososi,

    ...If the line starts with “>” (it is the first line of a FASTA file) the line is not considered...

    What if the line that starts with ">" is more than one in the file what happens?

    If I understand the OP's question, using the data provided, if I may suggest (adding to what others have said) using perl function split may do like so:

    use warnings; use strict; my $protein; while (<DATA>) { if (/^>/) { next; } else { $protein = join '', split; } } my $number_of_F = grep { /F/ } split //, $protein; print "The aminoacid sequence: ", $protein, " contains ", $number_of_F +, " Phenylalanine aminoacids", $/; __DATA__ >gi|403369491|gb|EJY84591.1| Transcriptional regulator, Sir2 family pr +otein Oxytricha trifallax MMKQLIKHNKNTPLFNFLRVKFSSTAATIQTQQTVNKPIESKFKEEKLDNYHDIYEKSKRLAEQISQSKS + FICFTGAGLSTSTGIPDYRSTSNTLAQTGAGAYELEISEEDKKSKTRQIRSQVQRAKPSISHMALHAL +ME NGYLKHLISQNTDGLHLKSGIPYQNLTELHGNTTVEYCKSCSKIYFRDFRCRSSEDPYHHLTGRQC +EDLK CGGELADEIVHFGESIPKDKLVEALTAASQSDLCLTMGTSLRVKPANQIPIQTIKNKGQLAIVN +LQYTPF DEIAQIRMHSFTDQVLEIVCQELNIKIPEYQMKRRIHIIRNAETNEIVVYGSYGNHKNIKLS +FMQRMEYI DNKNHVYLALDKEPFHIIPDYFNFQNINTDQEEVEFRIHFYGHNSEPYFQLTLPRQSILE +LQAGEHLICD ITFDYDKLEWK

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: PROTEIN FILE help me pleaseee
by pvaldes (Chaplain) on Jan 30, 2013 at 15:13 UTC

    Deliberately incomplete, after all this is (your) homework, but you have some of your problems solved here... only because Oxytricha is a cute green thing

    #!/usr/bin/perl -w use strict; my @protein = (); my $phenylcounter = 0; open (my $PROTEFILE, '<', $ARGV[0]) or die $!; # If the file is not existing it gives an error message # yup, I avoid the <STDIN> idea, solved by you. while ($line = <$PROTEFILE>) { # We read the lines of the file next if $line =~ m/^>/; # If the line starts with ">" + is not considered. $line =~ s/\s//g; # all white spaces removed from the line # Using the translate command, the program counts the number of F in +the sequence, assigning it to a variable. # left for you... tr/F//... $phenylcounter++} } # the while loop terminates close $PROTEFILE; # and the program continues closing the file. print "The aminoacid sequence", $ARGV[0], " contains ", $phenylcounter +, " Phenylalanine aminoacids"; print "\n"; # followed by new line

    Updated: ($phenilcounter != $phenylcounter), fixed now

Re: PROTEIN FILE help me pleaseee
by Kenosis (Priest) on Jan 30, 2013 at 16:59 UTC

    You've been given excellent suggestions (!):

    • choroba shows the three-argument open, including using lexically-scoped variables (don't use global variables for file opens).
    • 2teez shows a crucial first step: always use strict; use warnings;
    • pvaldes imparted a hint on counting the number of "F"s in the sequence

    Not to confuse matters here, but for your future reference (since you appear to be on a bioinformatics path), consider becoming well acquainted with Bio::SeqIO and its set of related modules.

    Just as there are well-developed modules to parse HTML, XML, and CSV files, Bio::SeqIO lives to do the same for Fasta and other such formats.

    For example, to retrieve and process each sequence within a Fasta file, you can do the following:

    use strict; use warnings; use Bio::SeqIO; print "Please type the file name of the protein sequence data: "; chomp( my $proteinfilename = <STDIN> ); print "\n"; my $fastaIN = Bio::SeqIO->new( -file => $proteinfilename, -format => ' +Fasta' ); while ( my $seq = $fastaIN->next_seq() ) { print $seq->seq, "\n"; }

    Each sequence in the Fasta file is accessible using the $seq->seq notation. The first part before the arrow operator is the sequence object; the part after the arrow operator is the method. These methods are covered in detail in the Bio::Seq module's documentation. In the example above, the sequence is printed, but a character count could be done there, too.

    Hope this helps!

Re: PROTEIN FILE help me pleaseee
by Anonymous Monk on Jan 30, 2013 at 23:06 UTC
    Please don't remove the original text of a comment.

      Please don't remove the original text of a comment.

      But if he keeps it, then teacher will know he cheated

        hi! I have the same problem and i'm going crazy. HELP ME! My perl program is: #!/usr/bin/perl -w print "Please type the file name of the protein sequence data: "; $proteinfilename = <STDIN>; chomp $proteinfilename; unless ( open (PROTEINFILE, $proteinfilename) ) { print "File \"$proteinfilename\" doesn\'t seem to exist!!\n"; } $protein = <PROTEINFILE> ; $empty = " " , (<PROTEINFILE>); while ( $protein = <PROTEINFILE> ) { if ( $protein =~ /^>/ ) { next $protein; } else { $protein =~ s/\s//g ; $union = join ($empty, $protein); } }; close PROTEINFILE ; $quantif = $union; $count = ($quantif =~ tr/F//); print "The aminoacid sequence:\n$union\ncontains $count Tryptophan aminoacids\n\n"; the problem is the count!!!! thank you Alessandra

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1016025]
Approved by 2teez
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-09-01 20:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (17 votes), past polls