|We don't bite newbies here... much|
Fuzzy String matching with index?by mir192a (Novice)
|on Sep 05, 2006 at 03:25 UTC||Need Help??|
mir192a has asked for the wisdom of the Perl Monks concerning the following question:
Hey - I'm working on a bioinformatics app. I need to do basically a fuzzy string match, and I need to determine the index at which the match was predicted (the length of said "match" would be cool too, but that is secondary).
I have tried String::Approx, String::Compare, agrep, egrep, of course regexs, String::Similarity, Text::Levenshtein, Algorithm::HowSimilar, et al. So far the best results have come by using agrep to do the initial fuzzy match (with the -# parm for # of allowed mismatches) and then using String::Approx aslice with the minimal distance option to determine the index and length. This is far from ideal.
The issues that I'm seeing with this strategy (besides the lack of cool factor):
1. Levenshtein inherently works for real strings (like the ones people use) not strings of RNA. For example, there are lots of "ACGG"s in my matched string, and the first one found probably isn't the one the fuzzy match found.
2. String::Approx plainly admits to have unreliable reports of both index and length - that is true.
3. The allowed mismatches parameters in String::Approx don't seem to be honored in all circumstances.
4. Speed - I will have to repeat this fuzzy match logic hundreds of thousands of times in a trial. Neither file from which I pull these strings will fit in memory.
5. Anchoring - I'd like to be able to anchor the match to the middle...the characters near the square brackets in string 3 are the most important.
Example: I want to know if string 1 kind of matches string 2. I'd really like to say "allow exactly 2 substitutions" or "tell me about stuff that matches 92%", etc. I'm agrepping (or amatching, both using Levenshtein) for string 1 in string 2. If nothing is found, I slice off a char from each end of string 2 and try the agrep/amatch again. I'm currently finding a "match" when string 2 has been reduced to string 3. At this time my reported index is 7, and length is 8. That is a good result - but it would be much better if the length was right every time, it was faster, and I could have finer control over the allowed edit distance.
string 1 - AATCCCTGGATTGAAGCGCaa
string 2 - 252: rs17347089 [Homo sapiens] TAATAAAAACTATATTCCTGGAATT[C/T]^GTTCTAAGTAAAAAAGTTTTCAAATT
string 3 - GGATTGA
If anyone knows of a better way to do this, I would greatly appreciate any advise at all. Thank you - Mandi