Pattern searching allowing for mis-matches...

MaroonBalloon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Pattern searching allowing for mis-matches... by GrandFather (Saint) on Dec 13, 2009 at 19:51 UTC
My favorite trick for this sort of fuzzy matching is to combine bit wise xor and the transliteration operator (used in this context for counting). Consider: `use strict; use warnings; my $target = 'TGATTGAATCAAGGTGTTTT'; my $match = 'TGAT'; my $quality = 0.75; my $matchLen = length $match; my $matchNum = int ($matchLen * $quality); for my $offset (0 .. length ($target) - $matchLen) { my $test = substr $target, $offset, $matchLen; my $matched = ($test ^ $match) =~ tr/\x00//; next if $matched < $matchNum; print "Found <$test> at offset $offset which matches in $matched p +laces\n"; }` [download] Prints: `Found <TGAT> at offset 0 which matches in 4 places Found <TGAA> at offset 4 which matches in 3 places Found <TGTT> at offset 14 which matches in 3 places` [download] True laziness is hard work	[reply] [d/l] [select]
Re^2: Pattern searching allowing for mis-matches... by MaroonBalloon (Acolyte) on Dec 15, 2009 at 07:53 UTC
The only problem with that is the threshold is not exact, for example, if the threshold is instead .5, and the length of the query is three characters, I still come up with all the matches that have at least (1/3). I have tried fixing this to no avail, here is my attempt: foreach my $key (keys %o) { my $keydisplay=1; my $target = $o{$key}; #my $threshold = 0.79; my $searchlength = length $search; my $truethreshold = int($searchlength * $threshold*100) /100; #(page46 +) if ($searchlength - $truethreshold >= .005) { $truethreshold += .01; } $searchlength = $truethreshold; for my $position (0 .. (length ($target) - $searchlength)) { my $test = substr $target, $position, $searchlength; my $matched = ($test ^ $search) =~ tr/\x00//; next if $matched < $truethreshold; print "\t\tFound <$test> at position $position which matches in $m +atched of $searchlength places\n"; } } [download]	[reply] [d/l]
Re^3: Pattern searching allowing for mis-matches... by GrandFather (Saint) on Dec 15, 2009 at 19:07 UTC
Characters are discrete entities. What outcome do you expect from matching 1/2 a character? True laziness is hard work	[reply]
Re^4: Pattern searching allowing for mis-matches... by MaroonBalloon (Acolyte) on Dec 15, 2009 at 20:24 UTC
Re: Pattern searching allowing for mis-matches... by BrowserUk (Patriarch) on Dec 13, 2009 at 18:32 UTC
Something like this? `$seq = 'TGTGTGTGTGTG';; @tests = qw[ TX TXXX ];; for my $test ( @tests ) { ( my $re = $test ) =~ tr[X][.]; print "'$1'" while $seq =~ m[(?=($re))]g; };; 'TG' 'TG' 'TG' 'TG' 'TG' 'TG' 'TGTG' 'TGTG' 'TGTG' 'TGTG' 'TGTG'` [download] Replacing your 'X's with '.'s gives you a wildcarding ability. By placing the capture inside a lookahead, the matchpoint will only be advanced by one position after each match, rather than the length of the match. That allows overlapping matches to be found. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l]
Re: Pattern searching allowing for mis-matches... by Albannach (Monsignor) on Dec 13, 2009 at 21:16 UTC
Another option I often use for this sort of thing is Text::Levenshtein. You should use the XS version if available for your platform though as the speed difference is significant. Below I've assumed the string must have the same length, but that is not necessary as an insertion or deletion also counts as distance. Accommodating variable length strings is left as an exercise for the reader. ;-) `use strict; use warnings; use Text::Levenshtein qw(distance); my $text = 'TGATTGAA'; my $search = 'TGAT'; my $fuzz = 1; # how far off a match can we be for my $start (0..(length($text) - length($search)) ) { my $chunk = substr($text,$start,length $search); print "checking for $search from position $start: $chunk: "; my $dist = distance($search, substr($text, $start, length $search)); if($dist == 0) { print "Match!\n"; }elsif($dist <= $fuzz) { print "Close enough\n"; }else{ print "nope\n"; } }` [download] checking for TGAT from position 0: TGAT: Match! checking for TGAT from position 1: GATT: nope checking for TGAT from position 2: ATTG: nope checking for TGAT from position 3: TTGA: nope checking for TGAT from position 4: TGAA: Close enough -- I'd like to be able to assign to an luser	[reply] [d/l]
Re: Pattern searching allowing for mis-matches... by bv (Friar) on Dec 13, 2009 at 17:33 UTC
You could probably do better using regular expressions with Global matching. Without completely understanding what you mean by "slight mismatches," (please explain!) I think something like this would accomplish your exact matches: `$matches = ($value =~ /$search/g);` UPDATE: using scalar where I did was wrong. `@_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;` [download]	[reply] [d/l] [select]
Re^2: Pattern searching allowing for mis-matches... by MaroonBalloon (Acolyte) on Dec 13, 2009 at 19:29 UTC
Yes, the direct matches are not a problem. I am glad you could confirm the code. For the mismatch code I mean: I have a hash value which is a DNA sequence, example: TGATTGAA If my % threshold was .75 for example, and if I was searching for TGAT, I would like for the program to tell me that there able for the program to find the identical match at position 1, AND identify the second match at position 5 as an "acceptable mismatch". With respect to your suggestion of regexp...Is it possible to type those in on the command line? Currently I use @ARGV[0] as my $search query, typically something like "ATC". Thanks! ER	[reply]
Re^3: Pattern searching allowing for mis-matches... by bv (Friar) on Dec 14, 2009 at 00:08 UTC
Your second question is much easier: Yes. Regular expressions can be built from any string, including those supplied by users. Generally, you should use quotemeta or the \Q and \E markers to make sure the string is free from regular expression meta characters like *, ., and more evil eval-type expressions. In your case, you could also check that the string is a valid nt sequence: `my $string = quotemeta shift; die "Not a valid nucleotide sequence" if $string =~ /[^AGTC]/;` [download] As for the first question, one way would be to build a regex for each possibility. An example: `my $string = "TGAT"; my @nts = map { my $tmp = $string; substr $tmp, $_, 1, '.'; $tmp; } (0 .. length ($string) -1); my $groupings = join '\|', @nts; my $sample = "TGATTGGAATGTTAGAT"; while ( $sample =~ /($groupings)/go ) { print "Matched $1 ending at position ", pos $sample, "\n"; }` [download] `@_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;` [download]	[reply] [d/l] [select]
Re: Pattern searching allowing for mis-matches... by erix (Prior) on Dec 13, 2009 at 17:24 UTC
It's not entirely clear to me what you are trying to do, but in case you're looking for motifs: MEME may be more efficient. Both binaries and webinterface are available here: MEME website	[reply]


Perl: the Markov chain saw
	PerlMonks