Nickmofoe has asked for the wisdom of the Perl Monks concerning the following question:
Greetings fellow monks. I have picked up perl for my bioinformatics class and im struggling with an assignment. So the gist is I need to write a program that reads a pdb flat file (https://files.rcsb.org/view/6U9D.pdb) and I want to essentially turn it into a FASTA file as well as calculate the geometric center of the atoms using the xyz coordinates the file provides. Now the first part, I have no trouble with. This is the code I have tried so far:
$k=0;
open (IN, '6U9D.pdb.txt');
%amino_acid_conversion = (ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C'
+,GLY=>'G', ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',GLU=>'E',HIS=>'H',TRP=
+>'W',LYS=>'K',PHE=>'F',PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V');
while (<IN>) {
if ($_=~m/HEADER\s+(.*)/){
print ">$1\n"; }
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq.=$1;
$seq=~s/ //g;}
}
#THIS IS THE PART I CANT SEEM TO FIGURE OUT
for ($i=0;$i<=length $seq; $i+=3) {
print "$amino_acid_conversion{substr($seq,$i,3)}"; #All seems well wit
+h the sequence I get the desired output
$k++;}
print "\n";
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\d+\s+(\S+)\s+(\S+)\s+(\S+)/){
+ #the parentheses are the xyz coordinates respectively
$x+=$1;
$y+=$2;
$z+=$3;
}
$xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k);
print "$xgk $ygk $zgk \n";
The total outpud I get is the sequence and three zeroes. It seems like its not even reading the coordinates? Not sure there... If any of you can share your wisdom I'd appreciate it! Oh and since this is an assignment please use the commands I have here, since this is what I have been taught so far.
Re: Issues regarding for loops and recursion
by choroba (Cardinal) on Jan 09, 2023 at 12:44 UTC
|
You are missing a \s+ between a \w+ and \d+. Also, the whole part needs to be moved inside the loop.
#! /usr/bin/perl
use warnings;
use strict;
open my $in, '<', '6U9D.pdb' or die $!;
my %amino_acid_conversion = (ALA => 'A', TYR => 'Y', MET => 'M', LEU =
+> 'L',
CYS => 'C', GLY => 'G', ARG => 'R', ASN =
+> 'N',
ASP => 'D', GLN => 'Q', GLU => 'E', HIS =
+> 'H',
TRP => 'W', LYS => 'K', PHE => 'F', PRO =
+> 'P',
SER => 'S', THR => 'T', ILE => 'I', VAL =
+> 'V');
my ($x, $y, $z) = (0, 0, 0);
my $seq;
while (<$in>) {
if (/HEADER\s+(.*)/) {
print ">$1\n";
}
if (/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/) {
$seq .= $1;
}
if (/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/)
+ {
# ~~~
$x += $1;
$y += $2;
$z += $3;
}
}
$seq =~ s/ //g;
my $k = 0;
for (my $i = 0; $i < length $seq; $i += 3) {
my $triplet = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$triplet}) {
print "$amino_acid_conversion{$triplet}";
} else {
warn "Don't know how to convert '$triplet'.\n";
}
$k++;
}
print "\n";
my $xgk = $x / $k;
my $ygk = $y / $k;
my $zgk = $z / $k;
print "$xgk $ygk $zgk \n";
Note that I also turned strict and warnings on, used the 3-argument version of open without bareword filehandles, and checked its return value.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Issues regarding for loops and recursion
by Corion (Patriarch) on Jan 09, 2023 at 12:44 UTC
|
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\d+\s+(\S+)\s+(\S+)\s+(\S+)/){
Here you match against $_, but it was never set. Did you want to match against $seq? I'm not sure.
Some general stylistic things:
open (IN, '6U9D.pdb.txt');
You should check whether opening the file actually succeeded:
my $filename = '6U9D.pdb.txt';
open (IN, $filename)
or die "Couldn't open '$filename': $!";
You should use the strict pragma. This requires you to declare all variables but on the upside, it allows Perl to tell you about typos in your variable names.
When assigning the %amino_acid_conversion, you could use newlines to make the code a bit more readable:
%amino_acid_conversion = (
ALA=>'A',
TYR=>'Y',
MET=>'M',
LEU=>'L',
CYS=>'C',
GLY=>'G',
ARG=>'R',
ASN=>'N',
ASP=>'D',
GLN=>'Q',
GLU=>'E',
HIS=>'H',
TRP=>'W',
LYS=>'K',
PHE=>'F',
PRO=>'P',
SER=>'S',
THR=>'T',
ILE=>'I',
VAL=>'V'
);
| [reply] [d/l] [select] |
|
%amino_acid_conversion = (
ALA => 'A',
ARG => 'R',
ASN => 'N',
ASP => 'D',
CYS => 'C',
GLN => 'Q',
GLU => 'E',
GLY => 'G',
HIS => 'H',
ILE => 'I',
LEU => 'L',
LYS => 'K',
MET => 'M',
PHE => 'F',
PRO => 'P',
SER => 'S',
THR => 'T',
TRP => 'W',
TYR => 'Y',
VAL => 'V'
);
| [reply] [d/l] [select] |
|
use YAML::XS qw( Load );
my %conversions = %{Load(<<'EOT')};
---
ALA: A
TYR: Y
MET: M
...
EOT
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: Issues regarding for loops and recursion
by Marshall (Canon) on Jan 12, 2023 at 12:38 UTC
|
I see some trouble with your code to parse out the sequences from SEQRES lines. There is some redundant information in the .pdb format. I used the sequential letter (A,B,C,D,E,etc) to delineate start/end of the various sequences. Number of 3 letter groups in the sequence would be another way. I am not sure what you are trying to do. At the end of code, see a couple of example FASTA segments I cut n' pasted from the program output.
I have no ide what to do with the x,y,z atom coordinates or even what they mean. I averaged them for fun. I changed your regex a bit to make it easier to pick out the +-floating point numbers.
To run code, download the .pdb file and name it "6U9D.pdb". In production code, I probably would not use so much memory for temporary results, preferring to calculate and output results as they occur rather than save a bunch of data and process it at the end. However for these prototype things, having the data in intermediate forms can useful for debugging. Have fun.
use strict;
use warnings;
open my $pdb_fh, '<', "6U9D.pdb" or die "unable to open 6U9D.pdb for r
+eading $!";
my %amino_acid_conversion = (
ALA => 'A',
ARG => 'R',
ASN => 'N',
ASP => 'D',
CYS => 'C',
GLN => 'Q',
GLU => 'E',
GLY => 'G',
HIS => 'H',
ILE => 'I',
LEU => 'L',
LYS => 'K',
MET => 'M',
PHE => 'F',
PRO => 'P',
SER => 'S',
THR => 'T',
TRP => 'W',
TYR => 'Y',
VAL => 'V');
my @atoms; #2D array of x,y,z of all atoms (many thousands)
my @seqs; #raw sequences with the 3 letter designations
#convert later to FASTA format
my $cur_seq = '';
my $cur_ltr = '';
# parse out data of interest from file
#
while (<$pdb_fh>)
{
if (my ($ltr, $seq) = (/^SEQRES\s+\d+\s+(\w+)\s+\d+\s+([A-Z ]+)$/)
+)
{
$seq =~ s/\s+$//;
if ($ltr eq $cur_ltr)
{
$cur_seq .= " $seq";
}
else
{
push @seqs, $cur_seq if $cur_seq ne ''; # end of current seq
+uence
$cur_seq = $seq; # begin of the next
+sequence
$cur_ltr = $ltr;
}
}
elsif (my ($x,$y,$z) = (/^ATOM\s+.*?([\d.-]+)\s+([\d.-]+)\s+([\d.-]
++)/) )
{
push @atoms, [$x,$y,$z];
}
}
push @seqs, $cur_seq; # don't forget to finish the last seq!
#### output collected data ###
#make a fasta sequence segments
foreach my $seq (@seqs)
{
# my $fasta = join '',map{$amino_acid_conversion{$_}}split ' ',$seq
+;
# without using a map:
#
my $fasta ='';
foreach my $char3 (split ' ',$seq)
{
$fasta.= $amino_acid_conversion{$char3}
}
print ">Some Fasta Description Line\n"; #use 60 char lines
while ($fasta) #fasta suggested max is 80
{
print substr($fasta,0,60,''),"\n";
}
}
#print the data points
# I am not sure what needs to be done with them
# average of each coordinate?
#foreach my $row_ref (@atoms) #uncomment to print
#{
# print @$row_ref,"\n";
#}
my $xsum; my $ysum; my $zsum;
foreach my $row_ref (@atoms) # @atoms is a 2D array
{
my ($x, $y , $z ) = @$row_ref;
$xsum+=$x;
$ysum+=$y;
$zsum+=$z;
}
print "avg x = ",$xsum/@atoms,"\n";
print "avg y = ",$ysum/@atoms,"\n";
print "avg z = ",$zsum/@atoms,"\n";
__END__
These are 2 examples:
You will have to figure out what goes in the FASTA description line
And perhaps not all of these sequences are relevant? Looks like
a lot are duplicates.
>Some Fasta Description Line
MHHHHHHENLYFQGAPSFNVDPLEQPAEPSKLAKKLRAEPDMDTSFVGLTGGQIFNEMMS
RQNVDTVFGYPGGAILPVYDAIHNSDKFNFVLPKHEQGAGHMAEGYARASGKPGVVLVTS
GPGATNVVTPMADAFADGIPMVVFTGQVPTSAIGTDAFQEADVVGISRSCTKWNVMVKSV
EELPLRINEAFEIATSGRPGPVLVDLPKDVTAAILRNPIPTKTTLPSNALNQLTSRAQDE
FVMQSINKAADLINLAKKPVLYVGAGILNHADGPRLLKELSDRAQIPVTTTLQGLGSFDQ
EDPKSLDMLGMHGCATANLAVQNADLIIAVGARFDDRVTGNISKFAPEARRAAAEGRGGI
IHFEVSPKNINKVVQTQIAVEGDATTNLGKMMSKIFPVKERSEWFAQINKWKKEYPYAYM
EETPGSKIKPQTVIKKLSKVANDTGRHVIVTTGVGQHQMWAAQHWTWRNPHTFITSGGLG
TMGYGLPAAIGAQVAKPESLVIDIDGDASFNMTLTELSSAVQAGTPVKILILNNEEQGMV
TQWQSLFYEHRYSHTHQLNPDFIKLAEAMGLKGLRVKKQEELDAKLKEFVSTKGPVLLEV
EVDKKVPVLPMVAGGSGLDEFINFDPEVERQQTELRHKRTGGKH
>Some Fasta Description Line
MGSSHHHHHHSSGLVPRGSHMENLYFQGATRPPLPTLDTPSWNANSAVSSIIYETPAPSR
QPRKQHVLNCLVQNEPGVLSRVSGTLAARGFNIDSLVVCNTEVKDLSRMTIVLQGQDGVI
EQARRQIEDLVPVYAVLDYTNSEIIKRELVMARISLLGTEYFEDLLLHHHTSTNAGAADS
QELVAEIREKQFHPANLPASEVLRLKHEHLNDITNLTNNFGGRVVDISETSCIVELSAKP
TRISAFLKLVEPFGVLECARSGMMALPRTPLKTSTEEAADEDEKISEIVDISQLPPG
I have no idea what these numbers would mean?
avg x = 321.013155298296
avg y = 290.744642162734
avg z = 69.196842162731
| [reply] [d/l] |
|
|