http://www.perlmonks.org?node_id=970077

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

I am beginner in perl programming. Suppose, I have two strings i.e. $seq1="AAAGATCGTG"; and $seq2="AGTACTTTCC"; I am interested in counting the number of substitutions of A in $seq1 by G in $seq2 i.e. $A2G and similarly, $G2A, $C2T, $T2C, $A2T, $A2C, $G2T and $G2C. Some sites like 0 and 5 are same in $seq1 and $seq2. Which perl code should I use to find out the number of such substitutions? I don't know the code that I should write in the places marked ?????? in the following program to find the correct results. Can any perlmonk help me to count the substitutions?

#!usr/bin/perl-w use strict; my $seq1="AAAGATCGTG"; my $seq2="AGTACTTTCC"; my ($A2G,$G2A,$C2T,$T2C,$A2T,$A2C,$G2T,$G2C) for (0..length($seq1)-1){ if (substr($seq1, ???) eq substr($seq2, ??)) {$A2G++;$G2A++;$C2T++;$T2C++;$A2T++;$A2C++;$G2T++;$G2C++;} else { ?????} } print"\n No. of Substitutions:\n A2G=$A2G; G2A=$G2A; C2T=$C2T; T2C=$T2C; A2T=$A2T; A2C=$A2C; G2T=$G2T +; G2C=$G2C.\n"; exit;

The results should look like:

A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.
  • Comment on How can I count the number of substitutions of one letter in a string by another letter in the other string?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How can I count the number of substitutions of one letter in a string by another letter in the other string?
by BrowserUk (Patriarch) on May 11, 2012 at 18:58 UTC

    Using individual named vars makes this awkward. Using a hash, makes it simple:

    #! perl =slw use strict; use Data::Dump qw[ pp ]; my $seq1="AAAGATCGTG"; my $seq2="AGTACTTTCC"; my %subs; for my $p ( 0 .. length( $seq1 ) - 1 ) { my $c1 = substr $seq1, $p, 1; my $c2 = substr $seq2, $p, 1; next if $c1 eq $c2; ++$subs{ $c1 . '2' . $c2 }; } pp \%subs; __END__ C:\test>junk.pl { A2C => 1, A2G => 1, A2T => 1, C2T => 1, G2A => 1, G2C => 1, G2T => 1 +, T2C => 1 }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      ooh, very pretty.
      Thanks, I was hoping to see something nice today.
Re: How can I count the number of substitutions of one letter in a string by another letter in the other string?
by Eliya (Vicar) on May 11, 2012 at 21:10 UTC

    In case performance matters and your sequences are longer, you could also use XOR bit logic in combination with tr///:

    #!/usr/bin/perl -w use strict; my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C); my %cnt; for my $change (@cnt) { my ($c1, $c2) = split /2/, $change; $c2 =~ tr/ATCG/HRDZ/; my $x = $c1 ^ $c2; $cnt{$change} = eval "sub { \$_[0] =~ tr/$x/$x/ }"; } my $seq1 = "AAAGATCGTG"; my $seq2 = "AGTACTTTCC"; (my $tmp = $seq2) =~ tr/ATCG/HRDZ/; my $diff = $seq1 ^ $tmp; print join("; ", map { $_."=".$cnt{$_}->($diff) } @cnt ), "."; __END__ A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

    The idea here is that XORing two strings produces a different value X for each combination of characters, which you can then count with a simple tr/X/X/.  The additional prior transformation tr/ATCG/HRDZ/ on one of the strings is required because the XOR operation would otherwise produce symmetric/undirected results, i.e. A2G would be indistinguishable from G2A, etc.

    (Just in case it isn't clear:  the %cnt hash doesn't hold the count results, but rather the subs doing the respective tr/X/X/ counting.)

      Whilst orders of magnitude faster, as posted, your solution is missing some possibilities.

      Given the input of:

      $seq1 = 'AAAACCCCGGGGTTTT'; $seq2 = 'ACGTACGTACGTACGT';

      It should produce 12 counts:

      { A2C => 1, A2G => 1, A2T => 1, C2A => 1, C2G => 1, C2T => 1, G2A => 1, G2C => 1, G2T => 1, T2A => 1, T2C => 1, T2G => 1, }

      But only produces the following eight:

      A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

      I'm sure that this is eminently fixable.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        your solution is missing some possibilities.

        Well, I implemented (on purpose) exactly the list the OP asked for.  It can of course be extended, if so desired:

        my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C ...);

        Alternatively, simply compute the full matrix of count routines:

        #!/usr/bin/perl -w use strict; my %cnt; my @c = qw(A T C G); for my $c1 (@c) { for my $c2 (@c) { next if $c1 eq $c2; (my $t = $c2) =~ tr/ATCG/HRDZ/; my $x = $c1 ^ $t; $cnt{$c1."2".$c2} = eval "sub { \$_[0] =~ tr/$x/$x/ }"; } } my $seq1 = 'AAAACCCCGGGGTTTT'; my $seq2 = 'ACGTACGTACGTACGT'; (my $tmp = $seq2) =~ tr/ATCG/HRDZ/; my $diff = $seq1 ^ $tmp; print join("; ", map { $_."=".$cnt{$_}->($diff) } sort keys %cnt ), ". +"; __END__ A2C=1; A2G=1; A2T=1; C2A=1; C2G=1; C2T=1; G2A=1; G2C=1; G2T=1; T2A=1; +T2C=1; T2G=1.