Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Mismatch Positions of Ambiguous String

by monkfan (Curate)
on Jan 25, 2006 at 09:32 UTC ( #525389=perlquestion: print w/replies, xml ) Need Help??

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have two strings that I want to compute the number of mismatches between them. These two strings are of the "same" size. Let's call them 'source' string and 'target' string. Now, the problem is that the 'source' string may come in ambiguous form, meaning that in one position they may contain more than 1 (upto 4) characters. The ambiguous position is marked with square bracketed [ATCG] region.

Is there an efficient way I can compute the number of mismatch between "source" and "target", of the above situation? The example is as follows:

Example 1 (where the source is ambiguous):
my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only
I have no problem when dealing when 'source' string is not ambigous like this
Example 2 (where the source is NOT ambiguous):
my $source2 = 'TGGGGT'; # not-ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target3 = 'TGGGGT'; # No of mismatch = 0 all position matches
For example I can use bitwise operator to do it. But for ambigous case I'm lost.

PS: "target" string can be assumed to be never ambiguous.

Regards,
Edward

Replies are listed 'Best First'.
Re: Mismatch Positions of Ambiguous String
by GrandFather (Saint) on Jan 25, 2006 at 11:10 UTC

    I like bitwise operators :)

    use strict; use warnings; my @targets = qw(AGGGGC TGGGGC); my @sources = (<DATA>); chomp @sources; for my $source (@sources) { my @fields = grep length, split /(\[[^[\]]+\]|[^[\]]+)/, $source; my @cells = map {(index $_, '[') == 0 ? $_ : split ''} @fields; my $matchStr = join '', map {/\[/ ? "\x80" : $_} @fields; for my $target (@targets) { my @segments = grep length, split /(\x00+|.)/, $target ^ $matchStr +; my $index = 0; for my $field (@segments) { next if length ($field) > 1 or $field eq "\x00"; # Only here for an ambiguous char my $str = $cells[$index]; my $chr = substr $target, $index, 1; next if $chr =~ /$str/; print "Mismatch at index $index in $source against $target\n"; } continue { $index += length $field; } } } __DATA__ [TCG]GGGG[AT] TGGGGT [AC][TCG]GGG[AT]

    Prints:

    Mismatch at index 0 in [TCG]GGGG[AT] against AGGGGC Mismatch at index 5 in [TCG]GGGG[AT] against AGGGGC Mismatch at index 5 in [TCG]GGGG[AT] against TGGGGC Mismatch at index 0 in TGGGGT against AGGGGC Mismatch at index 5 in TGGGGT against AGGGGC Mismatch at index 5 in TGGGGT against TGGGGC Mismatch at index 5 in [AC][TCG]GGG[AT] against AGGGGC Mismatch at index 0 in [AC][TCG]GGG[AT] against TGGGGC Mismatch at index 5 in [AC][TCG]GGG[AT] against TGGGGC

    DWIM is Perl's answer to Gödel
Re: Mismatch Positions of Ambiguous String
by pKai (Priest) on Jan 25, 2006 at 13:09 UTC

    Since characters in square brackets happen to be character classes in regexes, why not use them as such!?

    use strict; use warnings; use Algorithm::Loops qw(MapCarE); MapCarE { my ($s, $t) = @_; my $x= $t =~ /$s/ ? '~' : '!'; print "$t $x~ $s\n"; } [shift =~ /\[\w+\]|\w/g], # split source into chars or char classes [split //, shift]; # split target into chars

    D:\temp>mm [TCG]GGGG[AT] AGGGGC A !~ [TCG] G ~~ G G ~~ G G ~~ G G ~~ G C !~ [AT] D:\temp>mm [TCG]GGGG[AT] TGGGGC T ~~ [TCG] G ~~ G G ~~ G G ~~ G G ~~ G C !~ [AT]

    Using Algorithm::Loops just for the convenience to iterate over two lists at once without distracting from the main point.

Re: Mismatch Positions of Ambiguous String
by hv (Parson) on Jan 25, 2006 at 10:15 UTC

    I was going to recommend String::Approx, but it turns out that it doesn't yet support regexp patterns. Here's a simple approach to doing it yourself:

    sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.*?\]|.)/g); scalar grep substr($target, $_, 1) !~ /^$sparts[$_]/, 0 .. $#spart +s; }

    This assumes source and target are the same length; if you're going to be searching for the closest match within a longer target you'll need a bit more work, and if you need decent performance on very large strings you'll almost certainly want to code it in C/XS.

    Hugo

      Dear hv,
      Sorry for having have to come back to you again. How can I extend your code above, such that it can handle the target which is also ambiguous like this:
      my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGG[CT]'; # No of mismatch = 1 only at position 1
      Also the size of source and target will be the same. I tried this:
      sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.*?\]|.)/g); my @tparts = ($target =~ /(\[.*?\]|.)/g); scalar grep $tparts[$_] !~ /^$sparts[$_]/, 0 .. $#sparts; }
      But doesn't work. Where did I go wrong?

      Regards,
      Edward

        But doesn't work. Where did I go wrong?

        It isn't as simple as that: my original approach relied on treating each ambiguous [ACGT] in the source string as a regexp, but to do that it needs to know a) which is the fixed string and which the regexp, and b) that they are not both regexps.

        The simplest extension is as below, but this suffers further on speed - it will probably be too slow if you're dealing with strings a few thousand base pairs long:

        sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.*?\]|.)/g); my @tparts = ($target =~ /(\[.*?\]|.)/g); scalar grep { my($s, $t) = ($sparts[$_], $tparts[$_]); $s !~ /\[/ ? ($s !~ /$t/) : $t !~ /\[/ ? ($t !~ /$s/) : !intersect($s, $t) } 0 .. $#sparts; } sub intersect { my($s, $t) = @_; my %seen = map +($_ => 1), $s =~ /[^\[\]]/g; scalar grep $seen{$_}, $t =~ /[^\[\]]/g; }

        This says: if source is not ambiguous, treat the corresponding fragment of the target as a regexp; else if the target is not ambiguous, treat the source fragment as a regexp; else check a full intersection of the two.

        If your strings only include ACGT, a more efficient approach would be to transform each string into a bit vector that sets a bit for each base that may be present:

        my %bits = ('A' => 1, 'C' => 2, 'G' => 4, 'T' => 8); my $source1 = bitwise('[TCG]GGGG[AT]'); my $target1 = bitwise('AGGGG[CT]'); print mismatches($source, $target1), "\n"; sub mismatches { my($source, $target) = @_; ($source & $target) =~ tr/\0//; } sub bitwise { my $string = shift; join '', map { my $char = 0; $char |= $bits{$_} for /[ACGT]/g; chr($char) } $string =~ /(\[.*?\]|.)/g; }

        Once the strings are transformed into this bitwise representation, checking for mismatches is very fast even with long strings.

        Hugo

Re: Mismatch Positions of Ambiguous String
by prasadbabu (Prior) on Jan 25, 2006 at 10:17 UTC

    Here is my try for both ambiguous and not ambiguous strings. But this can be still simplified.

    use strict; use warnings; my $source= '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only my @tar = split //, $target1; my @t = @tar; for my $i (1..($#t+1)) { my $a =''; if ($source !~ /^\[/) #not ambiguous { ($a) = $source =~ /^(.)/s; $source =~ s/^.//s; my $s = shift @tar; ($s eq $a) ? print "$s matches position $i in target\n" : prin +t "$s not matches position $i in target\n"; } else #ambiguous { ($a) = $source =~ /^\[([^\]]*)/s; my @a = split //, $a; $source =~ s/^([^\]]*)\]//s; my $s = shift @tar; (my @m = grep /$s/, @a) ? print "[@a] matches position $i in t +arget\n" : print "[@a] not matches position $i in target\n"; } }

    Prasad

Re: Mismatch Positions of Ambiguous String
by tweetiepooh (Hermit) on Jan 25, 2006 at 11:14 UTC
    Maybe you need to think of the problem in reverse ie the target is the source and the source the target and see where that works.
    target = [ACT]GCTATC[GC] source = TGCTATCA
    Then you can apply each character or group in the target as a regex to the source. The answer should be the same if as in examples they are the same length.

    If not then you may need to keep some form of pointer to the actual place in the sequence (this looks like gene/DNA sequences).
Re: Mismatch Positions of Ambiguous String
by Samy_rio (Vicar) on Jan 25, 2006 at 10:52 UTC

    Hi monkfan, If i understood your question correctly, it helps you.

    use strict; my %compare = ('[TCG]GGGG[AT]'=>['AGGGGC', 'TGGGGC'], 'TGGGGT'=>['AGGG +GC','AGGGGC']); for my $ky (keys (%compare)) { my @source = split /GGGG/, $ky; my @arr = @{$compare{$ky}}; print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"; print "\nSource:\t", $ky; print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"; for my $t (0..$#arr){ my @tar = split /GGGG/,$arr[$t]; print "\nTarget:\t", $arr[$t], "\n"; my $mis = 0; for my $po (0..$#source){ if ($tar[$po] !~ m/$source[$po]/){ $mis++; print "\nPosition for Mismatch : ",$po + 1 if ($po == +0); print "\nPosition for Mismatch : ",$po + 5 if ($po == +1); } } print "\nNo of Mismatch = $mis\n"; } print "\n+++++++++++++++++++++++++++++\n"; } __END__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Source: [TCG]GGGG[AT] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 Target: TGGGGC Position for Mismatch : 6 No of Mismatch = 1 +++++++++++++++++++++++++++++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Source: TGGGGT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 +++++++++++++++++++++++++++++

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

Re: Mismatch Positions of Ambiguous String
by monkey_boy (Priest) on Jan 25, 2006 at 09:56 UTC
    As a first try, i would just permutate your ambiguose string producing all possible strings, then do your bitwise comparrison of all source strings vs the target string and take the lowest resulting score, this may not scale well , depending on the size of your ambigous string.



    This is not a Signature...
Re: Mismatch Positions of Ambiguous String
by murugu (Curate) on Jan 25, 2006 at 15:45 UTC

    Hi,

    Here is my substring approach.

    use strict; use warnings; my $str = 'TTCTTCT'; my $tar = 'ATGTTCT'; my $i =0; while (length($str)) { if (substr($str,0,1) eq '[') { my $index = index $str,']'; my $m = substr($str,0,$index+1); print "Mismatch at index $i\n" unless (substr($tar,$i,1)=~/$m/) +; substr($str,0,$index+1)=""; } else { print "Mismatch at index $i\n" unless (substr($str,0,1) eq subs +tr($tar,$i,1)); substr($str,0,1)=""; } $i++; }

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Re: Mismatch Positions of Ambiguous String
by Roy Johnson (Monsignor) on Jan 25, 2006 at 19:14 UTC
    A little different way:
    use strict; use warnings; my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only # Turn the source into a regexp that accepts any character at each # position, but only captures the matches (my $sourcepat = $source1) =~ s/(\[.*?\]|.)/(?:($1)|.)/g; for ($target1, $target2) { # Substitute _ for undef in the matches returned my @result = map defined()?$_:'_', /$sourcepat/; print "$_: @result\n"; } __DATA__ Output is: AGGGGC: _ G G G G _ TGGGGC: T G G G G _

    Caution: Contents may have been coded under pressure.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://525389]
Approved by Corion
Front-paged by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2021-11-27 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?