http://www.perlmonks.org?node_id=969839

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

I am a beginner in perl programming. I have three sequences like $a=AAATGCCTT, $b=AAAAGCGTC and $c=AAAGGCGTC, which differ at positions 4, 7 and 9 but the rest are alike. Is it possible to use perl code to find the total number of positions where they differ and where they are alike? In this case, the answer will be 3 (dissimilar) and 6 (similar), respectively. Can any perlmonk suggest me which perl code will compare these sequences for matching?

#!usr/bin/perl-w use strict; my $a=AAATGCCTT; my $b=AAAAGCGTC; my $c=AAAGGCGTC; $match=??? $nonmatch=??? perl code??? print"\n No. of matched positions=$match.\n No. of non-matched positions=$nonmatch.\n"; exit;

The answer should look like:

No. of matched positions=3. No. of non-matched positions=6.
  • Comment on Is it possible to find the number of matching and non-matching positions in strings using perl code?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Is it possible to find the number of matching and non-matching positions in strings using perl code?
by sauoq (Abbot) on May 10, 2012 at 17:04 UTC
    Is it possible to use perl code to find the total number of positions where they differ and where they are alike?

    Of course.

    Here is one straight forward approach:

    #!/usr/bin/perl my ($a, $b, $c) = qw (AAATGCCTT AAAAGCGTC AAAGGCGTC); my ($similar, $dissimilar); for (0 .. length($a)-1) { if (substr($a,$_,1) eq substr($b, $_, 1) and substr($b, $_, 1) eq +substr($c, $_, 1)) { print "MATCH"; $similar ++; } else { print "NO MATCH"; $dissimilar ++; } print " at position $_\n"; } print "There were " . $similar . " similar and " . $dissimilar . " dis +similar.\n";
    -sauoq
    "My two cents aren't worth a dime.";
Re: Is it possible to find the number of matching and non-matching positions in strings using perl code?
by moritz (Cardinal) on May 11, 2012 at 06:46 UTC

    Finding characters where two string differs can be done with bitwise operations. If you binary XOR two strings, positions where both characters are the same come out as a null byte. When doing several comparisons, one can accumulate the differing positions using binary OR:

    use warnings; use strict; use 5.010; # for say() my $a='AAATGCCTT'; my $b='AAAAGCGTC'; my $c='AAAGGCGTC'; my $mask = chr(0) x length $a; for ($b, $c) { $mask |= $a ^ $_; } # just to illustrate what the mask looks like: use Data::Dumper; $Data::Dumper::Useqq = 1; # count number of 0-bytes my $matches =()= $mask =~ /\0/g; say "Matches: ", $matches; say "Non-matches: ", length($a) - $matches;

    This approach should scale well for longer strings, since the binary operations are faster than looping over all characters.

      Nice explanation++.

      One minor change: s/binary operations are faster than looping over all characters/looping over the characters in C is faster than looping over the characters in Perl/


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        The question is whether or how much C-level looping is there actually. Some CISC processors include string operators so the XOR might actually be a single instruction.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

        But the for loop just obfuscates things without adding anything at all.

        It adds generality beyond three strings to compare.

        Also, calling your variable $mask is questionable as you don't really intend to use it as a mask.

        So what do you suggest instead? Your usage of $bits isn't any better, because you don't care about bits, but bytes. But $bytes also wouldn't explain the purpose of the variable.

Re: Is it possible to find the number of matching and non-matching positions in strings using perl code?
by BillKSmith (Monsignor) on May 10, 2012 at 22:49 UTC

    I would split the strings into character arrays and then count the matches returned by the each_array function of the module List::MoreUtils. Nomatches is the length of the string minus number of matches.

    If efficiency is an issue, the substr method already posted is probably better. Profile to be sure.

      I would split the strings into character arrays and then count the matches returned by the each_array function of the module List::MoreUtils.

      Eeek! No... don't do that. And, for that matter, don't use the approach I gave above either. I just wanted to show how, yes, it could easily be done just by automating the way you might do it by hand.

      If you want efficiency, resort to bit twiddling!

      Like this:

      #!/usr/bin/perl my ($a, $b, $c) = qw (AAATGCCTT AAAAGCGTC AAAGGCGTC); my $bits = ($a ^ $b) | ($b ^ $c); my $c = $bits =~ tr/\0/\0/; print "Similar: $c\n";

      :-)

      -sauoq
      "My two cents aren't worth a dime.";