sedm1000 has asked for the wisdom of the Perl Monks concerning the following question:

I'm a little new to this, but would much appreciate any help. I have lines of DNA sequence, aligned by the identity of letters within them. I'd like to sum the % identity of each in the sequence, generating a matrix for the entire alignment. i.e.
```fred  ATGTTGTAT
fred1 ATCTTATAT
fred2 ATCTTATAT

A  3 0 0 0 0 2 0 3 0
T  0 3 0 3 3 0 3 0 3
G  0 0 1 0 0 1 0 0 0
C  0 0 2 0 0 0 0 0 0
[download]```
This will be for 10K+ sequences though, hence the script requirement. I figure that the best way is to count the incidence of each letter at each position within each line and sum them in the matrix, but I'm yet to work out how to do this. Thoughts would be much obliged... Thanks.

Replies are listed 'Best First'.
Re: To count letters (%identity) in DNA alignment
by GrandFather (Saint) on Jan 27, 2009 at 00:40 UTC

Build a hash of arrays - one array for each letter:

```use strict;
use warnings;

my %frequencies;
my \$maxCol = 0;

while (<DATA>) {
chomp;

my (\$name, \$seq) = split;

next unless defined \$seq;

my @letters = split '', \$seq;

++\$frequencies{\$letters[\$_]}[\$_] for 0 .. \$#letters;
\$maxCol = \$#letters if \$maxCol < \$#letters;
}

for my \$letter (qw"A T G C") {
\$frequencies{\$letter}[\$_] ||= 0 for 0 .. \$maxCol;
print "\$letter @{\$frequencies{\$letter}}\n";
}

__DATA__
fred  ATGTTGTAT
fred1 ATCTTATAT
fred2 ATCTTATAT
[download]```

Prints:

```A 3 0 0 0 0 2 0 3 0
T 0 3 0 3 3 0 3 0 3
G 0 0 1 0 0 1 0 0 0
C 0 0 2 0 0 0 0 0 0
[download]```

Perl's payment curve coincides with its learning curve.
Minor nit-pick -- I'd want to include some defensive programming, because with 10K+ records to go through, there's (almost) no such thing as being too careful.

And while we're at it, if we're ready to deal with unsuitable data, might as well, point out when it happens:

```...
my \$bad_count = 0;
while (<...>) {
chomp;
my ( \$name, \$seq ) = split;
unless ( \$seq and \$seq =~ /^[ACGT]+\$/ ) {
\$bad_count++;
next;
}
...
}
warn "Input file had \$bad_count unusable lines\n";
...
[download]```
For that matter, if all the records are supposed to have the same number of letters, add that as part of the conditional.
Re: To count letters (%identity) in DNA alignment
by bobf (Monsignor) on Jan 27, 2009 at 03:32 UTC

I've been known to reinvent wheels from time to time, but this particular wheel has been around the block a few times...

CPAN. Specifically Bio::Matrix::PSM::IO::masta. Per the docs, this module will

convert a set of aligned sequences:
```ACATGCAT
ACAGGGAT
ACAGGCAT
ACCGGCAT
[download]```
to a PFM (SiteMatrix object)

Given the number of PFM-related modules within Bioperl I would imagine anything you want to do is already written.

Unfortunately I don't have time to provide an example. However, I would be very interested to see what you come up with if you go this route. Please post your solution when you are done so others can learn from your experience.

HTH

Thanks for your input - very useful. I skipped the BioPerl modules as I wanted the script to be passed around non-BioPerl users. After further help and modification the final product was.
```use strict;
use warnings;

my \$in = \$ARGV[0];

my \$frequencies;
my \$maxCol = 0;

open (IN, "\$in") or die "cannot open \$in!\n";

while (my \$line = <IN>) {
chomp \$line;

my (\$name, \$seq) = split ("\t", \$line);
if (defined \$seq) {
my @bases = split '', \$seq;
\$maxCol=@bases;
for (my \$i=0; \$i<@bases; \$i++) {
++\$frequencies->{\$bases[\$i]}[\$i];
#        if (\$maxCol < @bases) {
#        \$maxCol = @bases;
#        }
}
}
}

print_hoa(\$frequencies, \$maxCol);

#foreach my \$base (keys %frequencies) {
#    \$frequencies{\$base}[\$_] ||= 0 for 0 .. \$maxCol;
#    print "\$base @{\$frequencies{\$base}}\n";
#}

##Subroutine that prints a hash fo arrays reference.
sub print_hoa {
my (\$hashref, \$len) = @_;

for my \$person (sort keys %{\$hashref}) {
my \$array = \$hashref->{\$person};
my @array_print = @\$array;
print "\$person\t";

foreach my \$num (@array_print) {
if (defined \$num) {
print "\$num ";
} else {
print "0 ";
}
}
my \$i =@array_print;
if (\$i<\$len) {
my \$z_pr = \$len-\$i;
print "0 " x \$z_pr, "\n";
} else {
print "\n";
}
}
}
[download]```
Re: To count letters (%identity) in DNA alignment
by toolic (Bishop) on Jan 27, 2009 at 02:05 UTC
My first thought was to use a hash-of-hashes structure. If your data is as sparse as your sample input looks, this may save a little memory; otherwise, GrandFather's hash-of-arrays is more straight-forward:
```use strict;
use warnings;

my %counts;
my \$max = 0;
while (<DATA>) {
chomp;
my \$code = (split)[-1];
my \$i = 0;
for my \$c (split //, \$code) {
\$counts{\$c}{\$i++}++;
}
\$max = \$i if \$i > \$max;
}
\$max--;

for my \$c (keys %counts) {
print "\$c ";
for my \$i (0 .. \$max) {
if (exists \$counts{\$c}{\$i}) {
print " \$counts{\$c}{\$i}";
}
else {
print ' 0';
}
}
print "\n";
}

__DATA__
fred  ATGTTGTAT
fred1 ATCTTATAT
fred2 ATCTTATAT
[download]```

prints:

```A  3 0 0 0 0 2 0 3 0
T  0 3 0 3 3 0 3 0 3
C  0 0 2 0 0 0 0 0 0
G  0 0 1 0 0 1 0 0 0
[download]```

"10K+ sequences" - it's probably not sparse. ;)

Perl's payment curve coincides with its learning curve.