A while back someone asked about the best way to do "Fuzzy Matching" on a bulk scale. He wanted to find every possible "fuzzy match" of a set of fixed width strings of a restricted alphabet (ACGT) in a set of DNA data. By "fuzzy match" he meant a match where up to a certain number of chars in the matched string could be different. This node discusses what is involved setting up a reasonably rigourous comparison of the different solutions at hand and shows some comparison results of four solutions currently known to me. In replies we can cover each solution that comes up.
Now first off in order to even begin to compare a set of solutions to this problem we will need to define a common interface. We will use perl OO to make this as painless as possible. We will do this by defining a base class that represents the basic behaviour of a "fuzzy matcher". This class has the following methods:
- my $obj=Fuzzy::Matcher->new($fuzz,$strlen,$words,$wordsize);
- We need a way to create a new fuzzy matcher, and of course we will call it new() as convention demands. We specify that the interface should accept four possibly relevent parameters so that we can cater to as many specific implementations as possible with changing the constructor behavior
- my $obj=Fuzz::Matcher->_init();
- I find its useful when using inheritance to keep the actual constructor (where bless is called) seperate from a follow up init hook as it means you can avoid messing around with SUPER when you need to change the initialization behaviour of a subclass. In the baseclass this does nothing. We use a single '_' to denote that its "private" (or "friend") and not really meant to be called from outside of the class heirarchy. Its called in void context by new() immediately before returning the object to the user with any additional arguments that were provided to new() over those it is defined to accept. It should die on error.
- my $obj=$Fuzz_Matcher->fuzz_store($str);
Tell the matcher it should match $str in a search. Returns nothing. Used to load the object with the strings being searched for.
- my $obj=$Fuzzy_Matcher->prepare($str);
Its quite possible that a fuzzy matcher implementation could need to do some kind of pre-search preparation once it knows no more strings will be added to it. This method provides that hook. Before searching all users of the object should call this first. Returns nothing.
- my $obj=$Fuzzy_Matcher->fuzz_search($str);
Search $str for any fuzzy matches that are present. It should return a list of triplets, $ofs, $diff, $str that represent each match. $ofs is the position of the matched string in $str. The number of characters different is returned in $diff and the string matched is returned in $str. It should return the empty list if there were no matches. The user is expected to call prepare() on the object before calling this method. If they have not done so the behaviour is undefined.
And with that simple definition (and even more useful, the code it represents) any different implementations that people come up with may be easily cross tested against each other. Isnt OO fun :-)
Next we need a tool to create tests. We will want to test with a variety of different combinations of input strings and search strings so we will pull out Getopt::Long and whip together "make_test.pl" that will produce a single file that represents one set of such tests. Naturally once we have a bunch of test files we need something to do with them. For that we put together "test_fuzz.pl". We want to do testing in a pristine memory condition so we use some tricks to allow test_fuzz.pl to spawn copies of itself for individual tests and then use the outer instance to aggregate the results. And naturally we will want some pretty tables of the results. :-)
We will want to control our testing in a few ways. One of the more important being how long we will allow tests to run for. If we are doing a large batch and we know that a given solution is intolerably slow at a given scale, then tests for a larger scale will be likewise. Thus we will want to provide a way to skip tests that take too long. The --limit parameter allows us to do this.
Once we put all of this together we can produce our comparisons. We generate a bunch of test files of different types and then let test_fuzz.pl rip.
Heres a table of the data (slightly munged for html purposes) we get back. The 'N/A' are because the test too longer than 100 seconds the previous try. (It resets when the wordcount changes.) "SC" stands for string count, "SL" for string length, "F" for Fuzz, "WC" for word count and "WL" for word length.
|Test File ||DFA ||Ysth ||Xor2 ||Xor |
All the files involved will be attached via replies to keep the root a reasonable size. Discussion of the various approaches can follow those nodes.