http://www.perlmonks.org?node_id=812599

MaroonBalloon has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to actually iterate through my different nucleotide sequences searching for a query pattern to determine how many of these patterns I can find that are identical matches and which ones are slight mismatches. Here is my beginning. Say I am searching TGTGTGTGTGTG for all sequences matching the search TX Basically, if the program checks from left to right, the $x means that there will be 9 chances to check through this 10nt seequence. If my search was TXXX, then there would be 7 different sequences to check.
while (my ($key,$value) = each(%o)) { my $i; my $j; my $x = (length($value)-(length($search)-1));# x = the length of value + minus (the length of search minus 1) AKA how many times to iterate.. +. for (i=0; $x; $i++) { $l_seq = substr($value, $i, length($search)) for (j=0; j<(length($search)); $j++) { $match += substr($l_seq, $j, 1) eq substr($search,$j,1) } $mis_match (??)
It ends abruptly; I am not sure how to proceed in order to allow for mis-matching. Thanks, ER

Replies are listed 'Best First'.
Re: Pattern searching allowing for mis-matches...
by GrandFather (Saint) on Dec 13, 2009 at 19:51 UTC

    My favorite trick for this sort of fuzzy matching is to combine bit wise xor and the transliteration operator (used in this context for counting). Consider:

    use strict; use warnings; my $target = 'TGATTGAATCAAGGTGTTTT'; my $match = 'TGAT'; my $quality = 0.75; my $matchLen = length $match; my $matchNum = int ($matchLen * $quality); for my $offset (0 .. length ($target) - $matchLen) { my $test = substr $target, $offset, $matchLen; my $matched = ($test ^ $match) =~ tr/\x00//; next if $matched < $matchNum; print "Found <$test> at offset $offset which matches in $matched p +laces\n"; }

    Prints:

    Found <TGAT> at offset 0 which matches in 4 places Found <TGAA> at offset 4 which matches in 3 places Found <TGTT> at offset 14 which matches in 3 places

    True laziness is hard work
      The only problem with that is the threshold is not exact, for example, if the threshold is instead .5, and the length of the query is three characters, I still come up with all the matches that have at least (1/3). I have tried fixing this to no avail, here is my attempt:
      foreach my $key (keys %o) { my $keydisplay=1; my $target = $o{$key}; #my $threshold = 0.79; my $searchlength = length $search; my $truethreshold = int($searchlength * $threshold*100) /100; #(page46 +) if ($searchlength - $truethreshold >= .005) { $truethreshold += .01; } $searchlength = $truethreshold; for my $position (0 .. (length ($target) - $searchlength)) { my $test = substr $target, $position, $searchlength; my $matched = ($test ^ $search) =~ tr/\x00//; next if $matched < $truethreshold; print "\t\tFound <$test> at position $position which matches in $m +atched of $searchlength places\n"; } }

        Characters are discrete entities. What outcome do you expect from matching 1/2 a character?


        True laziness is hard work
Re: Pattern searching allowing for mis-matches...
by BrowserUk (Patriarch) on Dec 13, 2009 at 18:32 UTC

    Something like this?

    $seq = 'TGTGTGTGTGTG';; @tests = qw[ TX TXXX ];; for my $test ( @tests ) { ( my $re = $test ) =~ tr[X][.]; print "'$1'" while $seq =~ m[(?=($re))]g; };; 'TG' 'TG' 'TG' 'TG' 'TG' 'TG' 'TGTG' 'TGTG' 'TGTG' 'TGTG' 'TGTG'

    Replacing your 'X's with '.'s gives you a wildcarding ability.

    By placing the capture inside a lookahead, the matchpoint will only be advanced by one position after each match, rather than the length of the match. That allows overlapping matches to be found.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Pattern searching allowing for mis-matches...
by Albannach (Monsignor) on Dec 13, 2009 at 21:16 UTC
    Another option I often use for this sort of thing is Text::Levenshtein. You should use the XS version if available for your platform though as the speed difference is significant. Below I've assumed the string must have the same length, but that is not necessary as an insertion or deletion also counts as distance. Accommodating variable length strings is left as an exercise for the reader. ;-)
    use strict; use warnings; use Text::Levenshtein qw(distance); my $text = 'TGATTGAA'; my $search = 'TGAT'; my $fuzz = 1; # how far off a match can we be for my $start (0..(length($text) - length($search)) ) { my $chunk = substr($text,$start,length $search); print "checking for $search from position $start: $chunk: "; my $dist = distance($search, substr($text, $start, length $search)); if($dist == 0) { print "Match!\n"; }elsif($dist <= $fuzz) { print "Close enough\n"; }else{ print "nope\n"; } }

    checking for TGAT from position 0: TGAT: Match!
    checking for TGAT from position 1: GATT: nope
    checking for TGAT from position 2: ATTG: nope
    checking for TGAT from position 3: TTGA: nope
    checking for TGAT from position 4: TGAA: Close enough

    --
    I'd like to be able to assign to an luser

Re: Pattern searching allowing for mis-matches...
by bv (Friar) on Dec 13, 2009 at 17:33 UTC

    You could probably do better using regular expressions with Global matching. Without completely understanding what you mean by "slight mismatches," (please explain!) I think something like this would accomplish your exact matches:

    $matches = ($value =~ /$search/g);

    UPDATE: using scalar where I did was wrong.


    @_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;
      Yes, the direct matches are not a problem. I am glad you could confirm the code. For the mismatch code I mean: I have a hash value which is a DNA sequence, example:
      TGATTGAA

      If my % threshold was .75 for example, and if I was searching for TGAT, I would like for the program to tell me that there able for the program to find the identical match at position 1, AND identify the second match at position 5 as an "acceptable mismatch".

      With respect to your suggestion of regexp...Is it possible to type those in on the command line? Currently I use @ARGV[0] as my $search query, typically something like "ATC".
      Thanks! ER

        Your second question is much easier: Yes. Regular expressions can be built from any string, including those supplied by users. Generally, you should use quotemeta or the \Q and \E markers to make sure the string is free from regular expression meta characters like *, ., and more evil eval-type expressions. In your case, you could also check that the string is a valid nt sequence:

        my $string = quotemeta shift; die "Not a valid nucleotide sequence" if $string =~ /[^AGTC]/;

        As for the first question, one way would be to build a regex for each possibility. An example:

        my $string = "TGAT"; my @nts = map { my $tmp = $string; substr $tmp, $_, 1, '.'; $tmp; } (0 .. length ($string) -1); my $groupings = join '|', @nts; my $sample = "TGATTGGAATGTTAGAT"; while ( $sample =~ /($groupings)/go ) { print "Matched $1 ending at position ", pos $sample, "\n"; }

        @_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;
Re: Pattern searching allowing for mis-matches...
by erix (Prior) on Dec 13, 2009 at 17:24 UTC

    It's not entirely clear to me what you are trying to do, but in case you're looking for motifs: MEME may be more efficient. Both binaries and webinterface are available here:

    MEME website