http://www.perlmonks.org?node_id=970102


in reply to How can I count the number of substitutions of one letter in a string by another letter in the other string?

In case performance matters and your sequences are longer, you could also use XOR bit logic in combination with tr///:

#!/usr/bin/perl -w use strict; my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C); my %cnt; for my $change (@cnt) { my ($c1, $c2) = split /2/, $change; $c2 =~ tr/ATCG/HRDZ/; my $x = $c1 ^ $c2; $cnt{$change} = eval "sub { \$_[0] =~ tr/$x/$x/ }"; } my $seq1 = "AAAGATCGTG"; my $seq2 = "AGTACTTTCC"; (my $tmp = $seq2) =~ tr/ATCG/HRDZ/; my $diff = $seq1 ^ $tmp; print join("; ", map { $_."=".$cnt{$_}->($diff) } @cnt ), "."; __END__ A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

The idea here is that XORing two strings produces a different value X for each combination of characters, which you can then count with a simple tr/X/X/.  The additional prior transformation tr/ATCG/HRDZ/ on one of the strings is required because the XOR operation would otherwise produce symmetric/undirected results, i.e. A2G would be indistinguishable from G2A, etc.

(Just in case it isn't clear:  the %cnt hash doesn't hold the count results, but rather the subs doing the respective tr/X/X/ counting.)

  • Comment on Re: How can I count the number of substitutions of one letter in a string by another letter in the other string?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: How can I count the number of substitutions of one letter in a string by another letter in the other string?
by BrowserUk (Patriarch) on May 11, 2012 at 22:19 UTC

    Whilst orders of magnitude faster, as posted, your solution is missing some possibilities.

    Given the input of:

    $seq1 = 'AAAACCCCGGGGTTTT'; $seq2 = 'ACGTACGTACGTACGT';

    It should produce 12 counts:

    { A2C => 1, A2G => 1, A2T => 1, C2A => 1, C2G => 1, C2T => 1, G2A => 1, G2C => 1, G2T => 1, T2A => 1, T2C => 1, T2G => 1, }

    But only produces the following eight:

    A2G=1; G2A=1; C2T=1; T2C=1; A2T=1; A2C=1; G2T=1; G2C=1.

    I'm sure that this is eminently fixable.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      your solution is missing some possibilities.

      Well, I implemented (on purpose) exactly the list the OP asked for.  It can of course be extended, if so desired:

      my @cnt = qw(A2G G2A C2T T2C A2T A2C G2T G2C ...);

      Alternatively, simply compute the full matrix of count routines:

      #!/usr/bin/perl -w use strict; my %cnt; my @c = qw(A T C G); for my $c1 (@c) { for my $c2 (@c) { next if $c1 eq $c2; (my $t = $c2) =~ tr/ATCG/HRDZ/; my $x = $c1 ^ $t; $cnt{$c1."2".$c2} = eval "sub { \$_[0] =~ tr/$x/$x/ }"; } } my $seq1 = 'AAAACCCCGGGGTTTT'; my $seq2 = 'ACGTACGTACGTACGT'; (my $tmp = $seq2) =~ tr/ATCG/HRDZ/; my $diff = $seq1 ^ $tmp; print join("; ", map { $_."=".$cnt{$_}->($diff) } sort keys %cnt ), ". +"; __END__ A2C=1; A2G=1; A2T=1; C2A=1; C2G=1; C2T=1; G2A=1; G2C=1; G2T=1; T2A=1; +T2C=1; T2G=1.