serafinososi has asked for the
wisdom of the Perl Monks concerning the following question:
I need some help pleaseeeee: the question is...
Develop a Perl program that ask the user to type the name of a file in the working directory (containing one protein sequence in FASTA
format) and prints the number of Phenylalanine aminoacids in the
sequence.
-It prints a message, prompting the user to insert the file name.
-It reads the file name assigning it to one variable.
-It opens for reading the file with that name. If the file is not
existing it gives an error message.
-A scalar variable that will contain the sequence is initialized to
the empty string “”.
-The program reads one after the other (with a while loop) the lines
of the file assigning the read line to a variable.
- If the line starts with “>” (it is the first line of a FASTA file)
the line is not considered.
Else all the white spaces are removed from the line and the line is
postponed to the sequence.
-When all the lines have been read the while loops terminates and the
program continues closing the file.
-Using the translate command, the program counts the number of F in
the sequence, assigning it to a variable.
-It prints the message: “The aminoacid sequence ” followed by the
variable containing the read sequence “ contains “ followed by the
variable with the number of F, followed by “Phenylalanine aminoacids”
followed by new line.
This is the "protein file":
>gi|403369491|gb|EJY84591.1| Transcriptional regulator, Sir2 family protein Oxytricha trifallax
MMKQLIKHNKNTPLFNFLRVKFSSTAATIQTQQTVNKPIESKFKEEKLDNYHDIYEKSKRLAEQISQSKS
FICFTGAGLSTSTGIPDYRSTSNTLAQTGAGAYELEISEEDKKSKTRQIRSQVQRAKPSISHMALHALME
NGYLKHLISQNTDGLHLKSGIPYQNLTELHGNTTVEYCKSCSKIYFRDFRCRSSEDPYHHLTGRQCEDLK
CGGELADEIVHFGESIPKDKLVEALTAASQSDLCLTMGTSLRVKPANQIPIQTIKNKGQLAIVNLQYTPF
DEIAQIRMHSFTDQVLEIVCQELNIKIPEYQMKRRIHIIRNAETNEIVVYGSYGNHKNIKLSFMQRMEYI
DNKNHVYLALDKEPFHIIPDYFNFQNINTDQEEVEFRIHFYGHNSEPYFQLTLPRQSILELQAGEHLICD
ITFDYDKLEWK
I wrote this but doesn't work:
#!/usr/bin/perl -w
print "Please type the file name of the protein sequence data: ";
$proteinfilename = <STDIN>;
unless ( open (PROTEINFILE, $proteinfilename) ) {
print "File \"$proteinfilename\" doesn\'t seem to exist!!\n";
}
$protein = <PROTEINFILE> ;
$empty = " " , (<PROTEINFILE>);
while ( $protein = <PROTEINFILE> ) {
if ( $protein =~ /^>/ ) {
next $protein;
} else {
$protein =~ s/$empty//g ;
$protein = join ($protein, @protein);
print $protein;
}
}
Update: Thx
Re: PROTEIN FILE help me pleaseee (homework) by Anonymous Monk on Jan 30, 2013 at 11:08 UTC |
| [reply] |
Re: PROTEIN FILE help me pleaseee by choroba (Prior) on Jan 30, 2013 at 12:19 UTC |
What do you mean by "doesn't work"? Is the number of F's too big? Does the program output any error messages?
When glaring at your script, I can see that
| [reply] [d/l] |
Re: PROTEIN FILE help me pleaseee by 2teez (Chaplain) on Jan 30, 2013 at 14:58 UTC |
Hi serafinososi,
...If the line starts with “>” (it is the first line of a FASTA file) the line is not considered...
What if the line that starts with ">" is more than one in the file what happens?
If I understand the OP's question, using the data provided, if I may suggest (adding to what others have said) using perl function split may do like so:
use warnings;
use strict;
my $protein;
while (<DATA>) {
if (/^>/) {
next;
}
else {
$protein = join '', split;
}
}
my $number_of_F = grep { /F/ } split //, $protein;
print "The aminoacid sequence: ", $protein, " contains ", $number_of_F
+,
" Phenylalanine aminoacids", $/;
__DATA__
>gi|403369491|gb|EJY84591.1| Transcriptional regulator, Sir2 family pr
+otein Oxytricha trifallax
MMKQLIKHNKNTPLFNFLRVKFSSTAATIQTQQTVNKPIESKFKEEKLDNYHDIYEKSKRLAEQISQSKS
+ FICFTGAGLSTSTGIPDYRSTSNTLAQTGAGAYELEISEEDKKSKTRQIRSQVQRAKPSISHMALHAL
+ME NGYLKHLISQNTDGLHLKSGIPYQNLTELHGNTTVEYCKSCSKIYFRDFRCRSSEDPYHHLTGRQC
+EDLK CGGELADEIVHFGESIPKDKLVEALTAASQSDLCLTMGTSLRVKPANQIPIQTIKNKGQLAIVN
+LQYTPF DEIAQIRMHSFTDQVLEIVCQELNIKIPEYQMKRRIHIIRNAETNEIVVYGSYGNHKNIKLS
+FMQRMEYI DNKNHVYLALDKEPFHIIPDYFNFQNINTDQEEVEFRIHFYGHNSEPYFQLTLPRQSILE
+LQAGEHLICD ITFDYDKLEWK
If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me
| [reply] [d/l] |
Re: PROTEIN FILE help me pleaseee by pvaldes (Hermit) on Jan 30, 2013 at 15:13 UTC |
Deliberately incomplete, after all this is (your) homework, but you have some of your problems solved here... only because Oxytricha is a cute green thing
#!/usr/bin/perl -w
use strict;
my @protein = ();
my $phenylcounter = 0;
open (my $PROTEFILE, '<', $ARGV[0])
or die $!; # If the file is not existing it gives an error message
# yup, I avoid the <STDIN> idea, solved by you.
while ($line = <$PROTEFILE>) { # We read the lines of the file
next if $line =~ m/^>/; # If the line starts with ">"
+ is not considered.
$line =~ s/\s//g; # all white spaces removed from the line
# Using the translate command, the program counts the number of F in
+the sequence, assigning it to a variable.
# left for you... tr/F//...
$phenylcounter++}
} # the while loop terminates
close $PROTEFILE; # and the program continues closing the file.
print "The aminoacid sequence", $ARGV[0], " contains ", $phenylcounter
+, " Phenylalanine aminoacids";
print "\n"; # followed by new line
Updated: ($phenilcounter != $phenylcounter), fixed now | [reply] [d/l] |
Re: PROTEIN FILE help me pleaseee by Kenosis (Deacon) on Jan 30, 2013 at 16:59 UTC |
You've been given excellent suggestions (!):
- choroba shows the three-argument open, including using lexically-scoped variables (don't use global variables for file opens).
- 2teez shows a crucial first step: always use strict; use warnings;
- pvaldes imparted a hint on counting the number of "F"s in the sequence
Not to confuse matters here, but for your future reference (since you appear to be on a bioinformatics path), consider becoming well acquainted with Bio::SeqIO and its set of related modules.
Just as there are well-developed modules to parse HTML, XML, and CSV files, Bio::SeqIO lives to do the same for Fasta and other such formats.
For example, to retrieve and process each sequence within a Fasta file, you can do the following:
use strict;
use warnings;
use Bio::SeqIO;
print "Please type the file name of the protein sequence data: ";
chomp( my $proteinfilename = <STDIN> );
print "\n";
my $fastaIN = Bio::SeqIO->new( -file => $proteinfilename, -format => '
+Fasta' );
while ( my $seq = $fastaIN->next_seq() ) {
print $seq->seq, "\n";
}
Each sequence in the Fasta file is accessible using the $seq->seq notation. The first part before the arrow operator is the sequence object; the part after the arrow operator is the method. These methods are covered in detail in the Bio::Seq module's documentation. In the example above, the sequence is printed, but a character count could be done there, too.
Hope this helps! | [reply] [d/l] [select] |
Re: PROTEIN FILE help me pleaseee by Anonymous Monk on Jan 30, 2013 at 23:06 UTC |
Please don't remove the original text of a comment. | [reply] |
|
Please don't remove the original text of a comment. But if he keeps it, then teacher will know he cheated
| [reply] |
|
hi!
I have the same problem and i'm going crazy. HELP ME!
My perl program is:
#!/usr/bin/perl -w
print "Please type the file name of the protein sequence data: ";
$proteinfilename = <STDIN>;
chomp $proteinfilename;
unless ( open (PROTEINFILE, $proteinfilename) ) {
print "File \"$proteinfilename\" doesn\'t seem to exist!!\n";
}
$protein = <PROTEINFILE> ;
$empty = " " , (<PROTEINFILE>);
while ( $protein = <PROTEINFILE> ) {
if ( $protein =~ /^>/ ) {
next $protein;
} else {
$protein =~ s/\s//g ;
$union = join ($empty, $protein);
}
};
close PROTEINFILE ;
$quantif = $union;
$count = ($quantif =~ tr/F//);
print "The aminoacid sequence:\n$union\ncontains $count Tryptophan aminoacids\n\n";
the problem is the count!!!!
thank you
Alessandra
| [reply] |
|
|
|