|Problems? Is your data what you think it is?|
Fuzzy Searching: Optimizing Algorithm Selectionby Itatsumaki (Friar)
|on Nov 10, 2004 at 23:23 UTC||Need Help??|
Itatsumaki has asked for the
wisdom of the Perl Monks concerning the following question:
I'm well-aware that fuzzy-matching (e.g. Levenstein and String::Approx) have been discussed here several times. I need to do a very large quantity of fuzzy matching, and I'm looking for some theoretical suggestions on how to start developing. This is to be design-time optimization, not premature optimization, I hope.
The basic problem:
Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.
Three basic approaches come to mind for the fuzzy-matching:
So you can see the reason I'm looking for suggestions now -- only one of these three approaches works on single-strings. The performance of the other two won't seem to scale with the search and target library sizes in any trivial way. I'm not too sure how to figure out which one will be optimal for any given situation.
Any suggestions on how to determine which might be optimal for a given situation? And are there other (potentially faster) approaches I might have missed?
One note: I *am* aware that BLAST can do option #2 above -- I'm more trying to determine which option will be faster to run than worrying about ease-of-implementation right now.-Tats