### How can I count the number of substitutions of one letter in a string by another letter in the other string?

by supriyoch_2008 (Monk)
 on May 11, 2012 at 18:39 UTC Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

I am beginner in perl programming. Suppose, I have two strings i.e. \$seq1="AAAGATCGTG"; and \$seq2="AGTACTTTCC"; I am interested in counting the number of substitutions of A in \$seq1 by G in \$seq2 i.e. \$A2G and similarly, \$G2A, \$C2T, \$T2C, \$A2T, \$A2C, \$G2T and \$G2C. Some sites like 0 and 5 are same in \$seq1 and \$seq2. Which perl code should I use to find out the number of such substitutions? I don't know the code that I should write in the places marked ?????? in the following program to find the correct results. Can any perlmonk help me to count the substitutions?

```#!usr/bin/perl-w
use strict;
my \$seq1="AAAGATCGTG";
my \$seq2="AGTACTTTCC";
my (\$A2G,\$G2A,\$C2T,\$T2C,\$A2T,\$A2C,\$G2T,\$G2C)
for (0..length(\$seq1)-1){
if (substr(\$seq1, ???) eq substr(\$seq2, ??))
{\$A2G++;\$G2A++;\$C2T++;\$T2C++;\$A2T++;\$A2C++;\$G2T++;\$G2C++;}
else { ?????}
}
print"\n No. of Substitutions:\n
A2G=\$A2G; G2A=\$G2A; C2T=\$C2T; T2C=\$T2C; A2T=\$A2T; A2C=\$A2C; G2T=\$G2T
+; G2C=\$G2C.\n";
exit;

The results should look like:

A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.
• Comment on How can I count the number of substitutions of one letter in a string by another letter in the other string?

Replies are listed 'Best First'.
Re: How can I count the number of substitutions of one letter in a string by another letter in the other string?
by BrowserUk (Pope) on May 11, 2012 at 18:58 UTC

Using individual named vars makes this awkward. Using a hash, makes it simple:

```#! perl =slw
use strict;
use Data::Dump qw[ pp ];

my \$seq1="AAAGATCGTG";
my \$seq2="AGTACTTTCC";

my %subs;
for my \$p ( 0 .. length( \$seq1 ) - 1 ) {
my \$c1 = substr \$seq1, \$p, 1;
my \$c2 = substr \$seq2, \$p, 1;
next if \$c1 eq \$c2;
++\$subs{ \$c1 . '2' . \$c2 };
}

pp \%subs;

__END__
C:\test>junk.pl
{ A2C => 1, A2G => 1, A2T => 1, C2T => 1, G2A => 1, G2C => 1, G2T => 1
+, T2C => 1 }

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

ooh, very pretty.
Thanks, I was hoping to see something nice today.
Re: How can I count the number of substitutions of one letter in a string by another letter in the other string?
by Eliya (Vicar) on May 11, 2012 at 21:10 UTC

In case performance matters and your sequences are longer, you could also use XOR bit logic in combination with tr///:

```#!/usr/bin/perl -w
use strict;

my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C);
my %cnt;

for my \$change (@cnt) {
my (\$c1, \$c2) = split /2/, \$change;
\$c2 =~ tr/ATCG/HRDZ/;
my \$x = \$c1 ^ \$c2;
\$cnt{\$change} = eval "sub { \\$_[0] =~ tr/\$x/\$x/ }";
}

my \$seq1 = "AAAGATCGTG";
my \$seq2 = "AGTACTTTCC";

(my \$tmp = \$seq2) =~ tr/ATCG/HRDZ/;
my \$diff = \$seq1 ^ \$tmp;

print join("; ", map { \$_."=".\$cnt{\$_}->(\$diff) } @cnt ), ".";

__END__
A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

The idea here is that XORing two strings produces a different value X for each combination of characters, which you can then count with a simple tr/X/X/.  The additional prior transformation tr/ATCG/HRDZ/ on one of the strings is required because the XOR operation would otherwise produce symmetric/undirected results, i.e. A2G would be indistinguishable from G2A, etc.

(Just in case it isn't clear:  the %cnt hash doesn't hold the count results, but rather the subs doing the respective tr/X/X/ counting.)

Whilst orders of magnitude faster, as posted, your solution is missing some possibilities.

Given the input of:

```\$seq1 = 'AAAACCCCGGGGTTTT';
\$seq2 = 'ACGTACGTACGTACGT';

It should produce 12 counts:

```{
A2C => 1,  A2G => 1,  A2T => 1,  C2A => 1,  C2G => 1,  C2T => 1,
G2A => 1,  G2C => 1,  G2T => 1,  T2A => 1,  T2C => 1,  T2G => 1,
}

But only produces the following eight:

```A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

I'm sure that this is eminently fixable.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

your solution is missing some possibilities.

Well, I implemented (on purpose) exactly the list the OP asked for.  It can of course be extended, if so desired:

```my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C ...);

Alternatively, simply compute the full matrix of count routines:

```#!/usr/bin/perl -w
use strict;

my %cnt;
my @c = qw(A T C G);
for my \$c1 (@c) {
for my \$c2 (@c) {
next if \$c1 eq \$c2;
(my \$t = \$c2) =~ tr/ATCG/HRDZ/;
my \$x = \$c1 ^ \$t;
\$cnt{\$c1."2".\$c2} = eval "sub { \\$_[0] =~ tr/\$x/\$x/ }";
}
}

my \$seq1 = 'AAAACCCCGGGGTTTT';
my \$seq2 = 'ACGTACGTACGTACGT';

(my \$tmp = \$seq2) =~ tr/ATCG/HRDZ/;
my \$diff = \$seq1 ^ \$tmp;

print join("; ", map { \$_."=".\$cnt{\$_}->(\$diff) } sort keys %cnt ), ".
+";

__END__
A2C=1; A2G=1; A2T=1; C2A=1; C2G=1; C2T=1; G2A=1; G2C=1; G2T=1; T2A=1;
+T2C=1; T2G=1.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://970077]
Approved by sauoq
Front-paged by sauoq
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2018-05-21 20:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
World peace can best be achieved by:

Results (163 votes). Check out past polls.

Notices?