Mean calculation problem

by hellworld (Novice)
 on Jan 13, 2010 at 05:33 UTC Need Help??

hellworld has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a problem about calculating the mean of B-factors in a PDB File. Each aminoacid group should have the mean values of their own before I can push them into an array but I got stuck in the regex and couldn't create the correct loop since it gets mixed up with the regex of reading the ATOM lines. By aminoacids I mean that I should calculate the B-factor (numbers next to the 1.00's) mean of all lines of "LYS 2"(e.g. (4*28.52+5*92.54)/9 for LYS 2 line) and "GLN 3"s and all the others that follow seperately. Also, since each PDB file differs in format, I should make this calculation applied to all pdb files since the first ATOM line could also be something like "GLU 66", 67 and so on. Any ideas about how I could do this ? Thanks.
ATOM      1  N   LYS A   2     -10.231   0.078  18.128  1.00 28.52
+       N
ATOM      2  CA  LYS A   2      -9.472  -0.327  19.297  1.00 28.52
+       C
ATOM      3  C   LYS A   2     -10.308  -0.875  20.459  1.00 28.52
+       C
ATOM      4  O   LYS A   2     -10.180  -2.054  20.801  1.00 28.52
+       O
ATOM      5  CB  LYS A   2      -8.592   0.833  19.765  1.00 92.54
+       C
ATOM      6  CG  LYS A   2      -7.346   0.373  20.498  1.00 92.54
+       C
ATOM      7  CD  LYS A   2      -6.207   1.375  20.382  1.00 92.54
+       C
ATOM      8  CE  LYS A   2      -4.929   0.810  21.002  1.00 92.54
+       C
ATOM      9  NZ  LYS A   2      -3.756   1.724  20.882  1.00 92.54
+       N
ATOM     10  N   GLN A   3     -11.162  -0.038  21.053  1.00 28.24
+       N
ATOM     11  CA  GLN A   3     -11.998  -0.465  22.181  1.00 28.24
+       C
ATOM     12  C   GLN A   3     -12.749  -1.750  21.849  1.00 28.24
+       C
ATOM     13  O   GLN A   3     -12.882  -2.639  22.694  1.00 28.24
+       O
ATOM     14  CB  GLN A   3     -13.019   0.618  22.556  1.00 42.14
+       C
ATOM     15  CG  GLN A   3     -12.450   2.018  22.718  1.00 42.14
+       C
ATOM     16  CD  GLN A   3     -12.541   2.849  21.437  1.00 42.14
+       C
ATOM     17  OE1 GLN A   3     -12.042   2.453  20.377  1.00 42.14
+       O
ATOM     18  NE2 GLN A   3     -13.180   4.012  21.536  1.00 42.14
+       N
[download]
or another random example, for which the mean code should apply for all situations:
ATOM      1  N   LEU A  62      32.071  22.080  63.175  1.00 25.25
+       N
ATOM      2  CA  LEU A  62      32.171  21.102  62.056  1.00 25.25
+       C
ATOM      3  C   LEU A  62      31.367  19.843  62.260  1.00 25.25
+       C
ATOM      4  O   LEU A  62      31.676  18.842  61.650  1.00 25.25
+       O
ATOM      5  CB  LEU A  62      33.640  20.698  61.795  1.00 25.25
+       C
ATOM      6  CG  LEU A  62      34.589  21.606  60.965  1.00 25.25
+       C
ATOM      7  CD1 LEU A  62      35.877  20.829  60.616  1.00 25.25
+       C
ATOM      8  CD2 LEU A  62      33.903  22.137  59.675  1.00 25.25
+       C
ATOM      9  N   LEU A  63      30.358  19.879  63.124  1.00 25.25
+       N
ATOM     10  CA  LEU A  63      29.497  18.715  63.396  1.00 25.25
+       C
ATOM     11  C   LEU A  63      28.036  19.160  63.458  1.00 25.25
+       C
ATOM     12  O   LEU A  63      27.574  19.700  64.489  1.00 25.25
+       O
ATOM     13  CB  LEU A  63      29.869  17.995  64.717  1.00 25.25
+       C
ATOM     14  CG  LEU A  63      29.125  16.690  65.118  1.00 25.25
+       C
ATOM     15  CD1 LEU A  63      29.725  15.515  64.383  1.00 25.25
+       C
ATOM     16  CD2 LEU A  63      29.208  16.398  66.606  1.00 25.25
+       C
[download]

Replies are listed 'Best First'.
Re: Mean calculation problem
by BrowserUk (Pope) on Jan 13, 2010 at 07:53 UTC

If you're only interested in processing the ATOM records, then this may be good enough:

#! perl -slw
use strict;
use List::Util qw[ sum ];

my %stats;

while( <DATA> ) {
next unless m[^ATOM];
my @fields = unpack 'a6 a5 a3 a3 a4 x2 a3 x4 a8 a8 a7 a7 a6 a12',
+$_; # print join'|', @fields; push @{$stats{ $fields[ 4 ] } },$fields[ 10 ];
}

printf "%3s : %.3f%%\n", $_, sum( @{$stats{ $_ } } ) / @{$stats{ $_ } } for sort keys %stats; =output C:\test>junk2 GLN : 35.962% LEU : 25.250% LYS : 64.087% =cut __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890 +1234567890 ATOM 1 N LYS A 2 -10.231 0.078 18.128 1.00 28.52 + N ATOM 2 CA LYS A 2 -9.472 -0.327 19.297 1.00 28.52 + C ATOM 3 C LYS A 2 -10.308 -0.875 20.459 1.00 28.52 + C ATOM 4 O LYS A 2 -10.180 -2.054 20.801 1.00 28.52 + O ATOM 5 CB LYS A 2 -8.592 0.833 19.765 1.00 92.54 + C ATOM 6 CG LYS A 2 -7.346 0.373 20.498 1.00 92.54 + C ATOM 7 CD LYS A 2 -6.207 1.375 20.382 1.00 92.54 + C ATOM 8 CE LYS A 2 -4.929 0.810 21.002 1.00 92.54 + C ATOM 9 NZ LYS A 2 -3.756 1.724 20.882 1.00 92.54 + N ATOM 10 N GLN A 3 -11.162 -0.038 21.053 1.00 28.24 + N ATOM 11 CA GLN A 3 -11.998 -0.465 22.181 1.00 28.24 + C ATOM 12 C GLN A 3 -12.749 -1.750 21.849 1.00 28.24 + C ATOM 13 O GLN A 3 -12.882 -2.639 22.694 1.00 28.24 + O ATOM 14 CB GLN A 3 -13.019 0.618 22.556 1.00 42.14 + C ATOM 15 CG GLN A 3 -12.450 2.018 22.718 1.00 42.14 + C ATOM 16 CD GLN A 3 -12.541 2.849 21.437 1.00 42.14 + C ATOM 17 OE1 GLN A 3 -12.042 2.453 20.377 1.00 42.14 + O ATOM 18 NE2 GLN A 3 -13.180 4.012 21.536 1.00 42.14 + N ATOM 1 N LEU A 62 32.071 22.080 63.175 1.00 25.25 + N ATOM 2 CA LEU A 62 32.171 21.102 62.056 1.00 25.25 + C ATOM 3 C LEU A 62 31.367 19.843 62.260 1.00 25.25 + C ATOM 4 O LEU A 62 31.676 18.842 61.650 1.00 25.25 + O ATOM 5 CB LEU A 62 33.640 20.698 61.795 1.00 25.25 + C ATOM 6 CG LEU A 62 34.589 21.606 60.965 1.00 25.25 + C ATOM 7 CD1 LEU A 62 35.877 20.829 60.616 1.00 25.25 + C ATOM 8 CD2 LEU A 62 33.903 22.137 59.675 1.00 25.25 + C ATOM 9 N LEU A 63 30.358 19.879 63.124 1.00 25.25 + N ATOM 10 CA LEU A 63 29.497 18.715 63.396 1.00 25.25 + C ATOM 11 C LEU A 63 28.036 19.160 63.458 1.00 25.25 + C ATOM 12 O LEU A 63 27.574 19.700 64.489 1.00 25.25 + O ATOM 13 CB LEU A 63 29.869 17.995 64.717 1.00 25.25 + C ATOM 14 CG LEU A 63 29.125 16.690 65.118 1.00 25.25 + C ATOM 15 CD1 LEU A 63 29.725 15.515 64.383 1.00 25.25 + C ATOM 16 CD2 LEU A 63 29.208 16.398 66.606 1.00 25.25 + C [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. Thanks for spending time to write the code, helped a lot. But is there a way for me to apply this to a big chunk of data ? I mean, the aminoacid groups go as long as 1,2,...............264 and there can be 2600something ATOM lines depending on the PDB file. Sure. If you supply this version: #! perl -slw use strict; use List::Util qw[ sum ]; BEGIN{ @ARGV = map glob($_ ), @ARGV }

my %stats;

while( <> ) {
next unless m[^ATOM];
my @fields = unpack 'a6 a5 a3 a3 a4 x2 a3 x4 a8 a8 a7 a7 a6 a12',
+$_; # print join'|', @fields; push @{$stats{ $fields[ 4 ] } },$fields[ 10 ];
}

printf "%3s : %.3f%%\n", $_, sum( @{$stats{ $_ } } ) / @{$stats{ \$_ } }
for sort keys %stats;
[download]

with a wildcard path/filename on the command line,

theScript.pl /path/to/*.pdb
[download]

it will perform the stats across all the matching files.

Alternatively, if you want the stats on a file-by-file basis, then supply the filenames one at a time:

## For windows; something similar is possible on *nix
for %i in (\path\to\*.pdb) do @theScript.pl \path\to\%i
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Mean calculation problem
by bobf (Monsignor) on Jan 13, 2010 at 06:52 UTC

I got stuck in the regex

I would suggest starting with CPAN rather than writing your own parser. Bio::Structure::IO, Chemistry::File::PDB, Chemistry::Mol all look relevant, and there are likely several more. If you use an existing module to do the low-level work, you can focus on the real reason you are writing the code (i.e., it isn't because you want to parse a PDB file, it is because you want to do something with the contents of the file).

If you insist on rolling your own parser, look at unpack. It is great for fixed-width data, and will likely be much faster and less buggy than any regex you assemble.

HTH

Create A New User
Node Status?
node history
Node Type: perlquestion [id://817108]
Approved by matze77
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2021-05-17 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Perl 7 will be out ...

Results (152 votes). Check out past polls.

Notices?