Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Mean calculation problem

by hellworld (Novice)
on Jan 13, 2010 at 05:33 UTC ( #817108=perlquestion: print w/replies, xml ) Need Help??

hellworld has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a problem about calculating the mean of B-factors in a PDB File. Each aminoacid group should have the mean values of their own before I can push them into an array but I got stuck in the regex and couldn't create the correct loop since it gets mixed up with the regex of reading the ATOM lines. By aminoacids I mean that I should calculate the B-factor (numbers next to the 1.00's) mean of all lines of "LYS 2"(e.g. (4*28.52+5*92.54)/9 for LYS 2 line) and "GLN 3"s and all the others that follow seperately. Also, since each PDB file differs in format, I should make this calculation applied to all pdb files since the first ATOM line could also be something like "GLU 66", 67 and so on. Any ideas about how I could do this ? Thanks.
ATOM 1 N LYS A 2 -10.231 0.078 18.128 1.00 28.52 + N ATOM 2 CA LYS A 2 -9.472 -0.327 19.297 1.00 28.52 + C ATOM 3 C LYS A 2 -10.308 -0.875 20.459 1.00 28.52 + C ATOM 4 O LYS A 2 -10.180 -2.054 20.801 1.00 28.52 + O ATOM 5 CB LYS A 2 -8.592 0.833 19.765 1.00 92.54 + C ATOM 6 CG LYS A 2 -7.346 0.373 20.498 1.00 92.54 + C ATOM 7 CD LYS A 2 -6.207 1.375 20.382 1.00 92.54 + C ATOM 8 CE LYS A 2 -4.929 0.810 21.002 1.00 92.54 + C ATOM 9 NZ LYS A 2 -3.756 1.724 20.882 1.00 92.54 + N ATOM 10 N GLN A 3 -11.162 -0.038 21.053 1.00 28.24 + N ATOM 11 CA GLN A 3 -11.998 -0.465 22.181 1.00 28.24 + C ATOM 12 C GLN A 3 -12.749 -1.750 21.849 1.00 28.24 + C ATOM 13 O GLN A 3 -12.882 -2.639 22.694 1.00 28.24 + O ATOM 14 CB GLN A 3 -13.019 0.618 22.556 1.00 42.14 + C ATOM 15 CG GLN A 3 -12.450 2.018 22.718 1.00 42.14 + C ATOM 16 CD GLN A 3 -12.541 2.849 21.437 1.00 42.14 + C ATOM 17 OE1 GLN A 3 -12.042 2.453 20.377 1.00 42.14 + O ATOM 18 NE2 GLN A 3 -13.180 4.012 21.536 1.00 42.14 + N
or another random example, for which the mean code should apply for all situations:
ATOM 1 N LEU A 62 32.071 22.080 63.175 1.00 25.25 + N ATOM 2 CA LEU A 62 32.171 21.102 62.056 1.00 25.25 + C ATOM 3 C LEU A 62 31.367 19.843 62.260 1.00 25.25 + C ATOM 4 O LEU A 62 31.676 18.842 61.650 1.00 25.25 + O ATOM 5 CB LEU A 62 33.640 20.698 61.795 1.00 25.25 + C ATOM 6 CG LEU A 62 34.589 21.606 60.965 1.00 25.25 + C ATOM 7 CD1 LEU A 62 35.877 20.829 60.616 1.00 25.25 + C ATOM 8 CD2 LEU A 62 33.903 22.137 59.675 1.00 25.25 + C ATOM 9 N LEU A 63 30.358 19.879 63.124 1.00 25.25 + N ATOM 10 CA LEU A 63 29.497 18.715 63.396 1.00 25.25 + C ATOM 11 C LEU A 63 28.036 19.160 63.458 1.00 25.25 + C ATOM 12 O LEU A 63 27.574 19.700 64.489 1.00 25.25 + O ATOM 13 CB LEU A 63 29.869 17.995 64.717 1.00 25.25 + C ATOM 14 CG LEU A 63 29.125 16.690 65.118 1.00 25.25 + C ATOM 15 CD1 LEU A 63 29.725 15.515 64.383 1.00 25.25 + C ATOM 16 CD2 LEU A 63 29.208 16.398 66.606 1.00 25.25 + C

Replies are listed 'Best First'.
Re: Mean calculation problem
by BrowserUk (Pope) on Jan 13, 2010 at 07:53 UTC

    If you're only interested in processing the ATOM records, then this may be good enough:

    #! perl -slw use strict; use List::Util qw[ sum ]; my %stats; while( <DATA> ) { next unless m[^ATOM]; my @fields = unpack 'a6 a5 a3 a3 a4 x2 a3 x4 a8 a8 a7 a7 a6 a12', +$_; # print join'|', @fields; push @{ $stats{ $fields[ 4 ] } }, $fields[ 10 ]; } printf "%3s : %.3f%%\n", $_, sum( @{ $stats{ $_ } } ) / @{ $stats{ $_ } } for sort keys %stats; =output C:\test>junk2 GLN : 35.962% LEU : 25.250% LYS : 64.087% =cut __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890 +1234567890 ATOM 1 N LYS A 2 -10.231 0.078 18.128 1.00 28.52 + N ATOM 2 CA LYS A 2 -9.472 -0.327 19.297 1.00 28.52 + C ATOM 3 C LYS A 2 -10.308 -0.875 20.459 1.00 28.52 + C ATOM 4 O LYS A 2 -10.180 -2.054 20.801 1.00 28.52 + O ATOM 5 CB LYS A 2 -8.592 0.833 19.765 1.00 92.54 + C ATOM 6 CG LYS A 2 -7.346 0.373 20.498 1.00 92.54 + C ATOM 7 CD LYS A 2 -6.207 1.375 20.382 1.00 92.54 + C ATOM 8 CE LYS A 2 -4.929 0.810 21.002 1.00 92.54 + C ATOM 9 NZ LYS A 2 -3.756 1.724 20.882 1.00 92.54 + N ATOM 10 N GLN A 3 -11.162 -0.038 21.053 1.00 28.24 + N ATOM 11 CA GLN A 3 -11.998 -0.465 22.181 1.00 28.24 + C ATOM 12 C GLN A 3 -12.749 -1.750 21.849 1.00 28.24 + C ATOM 13 O GLN A 3 -12.882 -2.639 22.694 1.00 28.24 + O ATOM 14 CB GLN A 3 -13.019 0.618 22.556 1.00 42.14 + C ATOM 15 CG GLN A 3 -12.450 2.018 22.718 1.00 42.14 + C ATOM 16 CD GLN A 3 -12.541 2.849 21.437 1.00 42.14 + C ATOM 17 OE1 GLN A 3 -12.042 2.453 20.377 1.00 42.14 + O ATOM 18 NE2 GLN A 3 -13.180 4.012 21.536 1.00 42.14 + N ATOM 1 N LEU A 62 32.071 22.080 63.175 1.00 25.25 + N ATOM 2 CA LEU A 62 32.171 21.102 62.056 1.00 25.25 + C ATOM 3 C LEU A 62 31.367 19.843 62.260 1.00 25.25 + C ATOM 4 O LEU A 62 31.676 18.842 61.650 1.00 25.25 + O ATOM 5 CB LEU A 62 33.640 20.698 61.795 1.00 25.25 + C ATOM 6 CG LEU A 62 34.589 21.606 60.965 1.00 25.25 + C ATOM 7 CD1 LEU A 62 35.877 20.829 60.616 1.00 25.25 + C ATOM 8 CD2 LEU A 62 33.903 22.137 59.675 1.00 25.25 + C ATOM 9 N LEU A 63 30.358 19.879 63.124 1.00 25.25 + N ATOM 10 CA LEU A 63 29.497 18.715 63.396 1.00 25.25 + C ATOM 11 C LEU A 63 28.036 19.160 63.458 1.00 25.25 + C ATOM 12 O LEU A 63 27.574 19.700 64.489 1.00 25.25 + O ATOM 13 CB LEU A 63 29.869 17.995 64.717 1.00 25.25 + C ATOM 14 CG LEU A 63 29.125 16.690 65.118 1.00 25.25 + C ATOM 15 CD1 LEU A 63 29.725 15.515 64.383 1.00 25.25 + C ATOM 16 CD2 LEU A 63 29.208 16.398 66.606 1.00 25.25 + C

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for spending time to write the code, helped a lot. But is there a way for me to apply this to a big chunk of data ? I mean, the aminoacid groups go as long as 1,2,...............264 and there can be 2600something ATOM lines depending on the PDB file.

        Sure. If you supply this version:

        #! perl -slw use strict; use List::Util qw[ sum ]; BEGIN{ @ARGV = map glob( $_ ), @ARGV } my %stats; while( <> ) { next unless m[^ATOM]; my @fields = unpack 'a6 a5 a3 a3 a4 x2 a3 x4 a8 a8 a7 a7 a6 a12', +$_; # print join'|', @fields; push @{ $stats{ $fields[ 4 ] } }, $fields[ 10 ]; } printf "%3s : %.3f%%\n", $_, sum( @{ $stats{ $_ } } ) / @{ $stats{ $_ } } for sort keys %stats;

        with a wildcard path/filename on the command line,

        theScript.pl /path/to/*.pdb

        it will perform the stats across all the matching files.

        Alternatively, if you want the stats on a file-by-file basis, then supply the filenames one at a time:

        ## For windows; something similar is possible on *nix for %i in (\path\to\*.pdb) do @theScript.pl \path\to\%i

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Mean calculation problem
by bobf (Monsignor) on Jan 13, 2010 at 06:52 UTC

    I got stuck in the regex

    I would suggest starting with CPAN rather than writing your own parser. Bio::Structure::IO, Chemistry::File::PDB, Chemistry::Mol all look relevant, and there are likely several more. If you use an existing module to do the low-level work, you can focus on the real reason you are writing the code (i.e., it isn't because you want to parse a PDB file, it is because you want to do something with the contents of the file).

    If you insist on rolling your own parser, look at unpack. It is great for fixed-width data, and will likely be much faster and less buggy than any regex you assemble.

    HTH

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://817108]
Approved by matze77
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2021-05-17 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (152 votes). Check out past polls.

    Notices?