Itatsumaki has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I'm well-aware that fuzzy-matching (e.g. Levenstein and String::Approx) have been discussed here several times. I need to do a very large quantity of fuzzy matching, and I'm looking for some theoretical suggestions on how to start developing. This is to be design-time optimization, not premature optimization, I hope.
The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:
- exact
- fuzzy (1- and 2-letter mismatches)
Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.
Three basic approaches come to mind for the fuzzy-matching:
- Dynamically construct 25 separate regex's, one for each possible mismatch position. For instance:
I'm sure that can be tuned substantially, just trying to give the general idea of what I had in mind.my @letters = split(//, $string); my $i = 0; my @regexes; for (my $j = 0; $j < scalar(@letters); $j++) ( for (my $i = 0; $i < scalar(@letters); $k++) { if ($i != $j) { $regexes[$i] .= $letters[$j]; } else { $regexes[$i] .= '.{1}'; } } }
- On the other hand, because I know that my search sequence will always be 25 letters long I know that every fuzzy-by-one match will contain a 12-letter exact match, and any fuzzy-by-two match will contain an 7-letter exact match so I could index the large library for all possible 12-mers. I would then break each sequence into the 13 possible 12-mers and extend on either side looking for fuzzy-matches as above. The advantage of this approach would be that I am only searching regions of the library that have a theoretical chance of matching my string.
- Finally, since I have large libraries of both search strings (25-letters) and target strings (variable length), it might be advantageous to just parse out all the possible 25-length strings from the target library once. Once I have this index, identifying exact or fuzzy matches would be trivial.
So you can see the reason I'm looking for suggestions now -- only one of these three approaches works on single-strings. The performance of the other two won't seem to scale with the search and target library sizes in any trivial way. I'm not too sure how to figure out which one will be optimal for any given situation.
Any suggestions on how to determine which might be optimal for a given situation? And are there other (potentially faster) approaches I might have missed?
One note: I *am* aware that BLAST can do option #2 above -- I'm more trying to determine which option will be faster to run than worrying about ease-of-implementation right now.
-Tats
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Fuzzy Searching: Optimizing Algorithm Selection
by tachyon (Chancellor) on Nov 11, 2004 at 00:29 UTC | |
The best approach may simply be a straight string scan fail fast approach. You can get this far more optimized than the RE engine could manage. It should certainly be in C for speed. Breaking the dictionary into all 25 char long patterns serves no useful purpose - it simply increases the disk I/O by a factor of 25x. Ideally you would want the entire dictionary in memory anyway so you totally avoid any disk I/O bottlenecks. The type of indexing you may or may not be able to do depends on the data. You can form a useful index based on the fact, that as you note, even an off by 2 match must have at least 7 consecutive matching chars - in fact I beleive it is impossible to have less than 8 as 25-2 = 23 Thus the worst case min lengths possible are 7+8+8=23 so you must have at least 2 8+ char stings that match. If you only have 2 possible values in your strings the probability of randonly finding 8 consecutive chars in a given order is 2^8 or 1/256, If you are talking 4 chars ie CGAT then it is 1/65536. This should dramatically shrink the search space as you only need to scan 17 chars upstream and downstream of this sequence. Each search string can be broken into 18 8 char possibles. The problem with this may be (assuming CGAT alphabet) and 100 search strings you will have nearly 1800 unique index strings giving an average spacing in the search string of 65536/1800 ~ 36 chars. Now we have to search 17+8+17 = 42 chars so in a nutshell you still probably have to scan the whole search string every time. The advantage is limited to having to perform the scan on a limited subset of the find strings. There is of course a price for the hashing and lookup. If the alphabet (A) is larger than 4 chars then you should see very significant benefits due the to relative scarcity of A^8 substrings. I would quite possibly be worth precomputing the hash values for all the 8 char strings in your find an search spaces to speed your many to many search. Read more... (1257 Bytes) | [reply] [Watch: Dir/Any] [d/l] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by tall_man (Parson) on Nov 11, 2004 at 00:06 UTC | |
| [reply] [Watch: Dir/Any] [d/l] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Patriarch) on Nov 11, 2004 at 02:29 UTC | |
If you XOR two strings together,
The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not. If you count the non-null bytes, it gives you a score for the fuzziness of the match.
To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence. Putting that all together, I got:
This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass. I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker. If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches? | [reply] [Watch: Dir/Any] [d/l] [select] |
by tachyon (Chancellor) on Nov 11, 2004 at 03:46 UTC | |
At 3.5 seconds per offset it will complete in roughly 3 years ;-) Nonetheless XOR works quickly in Perl. A small optimisation is to take the substr outside the inner keys loop as you are doing it 499,999 more times than required per offset. Real sample data would be nice. The benefits from indexing depend on the alphabet size and the number of find strings. cheers tachyon | [reply] [Watch: Dir/Any] |
by BrowserUk (Patriarch) on Nov 11, 2004 at 03:57 UTC | |
Update: Indeed. The two optimisations reduce the 500,000 comparisons time from ~3.5 seconds to + .5 of a second. That reduces the projected overall runtime of my test scenario from 3+ years to under half a year. Worth doing :) Yes. I performed the same calculation based upon my decision to use 500,000 25-ers. Hence my decision to ask for clarification. There are several ways to speed up the processing. Using
to do the counting, rather than
is (probably) another. I'm just waiting for a limited benchmark I have running to complete before trying several things. I had thought that by avoiding actually copying each 25 char string out of the sequence I might save some time/memory, but now you've pointed it out, I realise that I can create an LVALUE ref to the substring and avoid 500,000 calls to substr for each inner loop. Thanks. | [reply] [Watch: Dir/Any] [d/l] [select] |
by kscaldef (Pilgrim) on Nov 11, 2004 at 03:28 UTC | |
| [reply] [Watch: Dir/Any] |
by demerphq (Chancellor) on Nov 12, 2004 at 18:44 UTC | |
A DFA implmentation is the next step from trie. In fact a trie can be viewed as an alternative representation of a DFA with anchoring at the string start.
--- demerphq | [reply] [Watch: Dir/Any] |
by BrowserUk (Patriarch) on Dec 04, 2004 at 08:56 UTC | |
In case Itatsumaki should ever come back. Here's a somewhat improved implementation of my Xor algorithm. The original projection of 3 1.2 years runtime to process 100,000 x 1,000 x 30,000 is now reduced to 7.6 days: Update (2004/12/08): Updated code to correct an error that manifested itself when the sequences were not an exact multiple of the key length. (As noted below by demerphq)
A coupe of runs (on single sequences for timing purposes) on data comparable (produced by the same code) as other timings published elsewhere:
Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon | [reply] [Watch: Dir/Any] [d/l] [select] |
by demerphq (Chancellor) on Dec 08, 2004 at 18:14 UTC | |
As far as I can tell this code fails the "A" x 10/"A" x 11 test that you challenged my code with in a different thread. I was able to fix the bug by changing the if ($fuz <= $FUZZY) test to the following:
I have put together a test harness and framework for developing Fuzzy::Matcher implementations which I will post in a new thread (when i get sufficient time) as IMO this one has become too polluted with acrimony to be worth continuing. Your code as massaged to fit into this framework (along with the baseclass) is in the following readmore. Note that the test harness monitors memory utilization post-prepare() which is why the default prepare() is the way it is (to reduce memory overhead).
--- demerphq | [reply] [Watch: Dir/Any] [d/l] [select] |
by BrowserUk (Patriarch) on Dec 08, 2004 at 23:46 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 09:00 UTC | |
| |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by ikegami (Patriarch) on Nov 11, 2004 at 00:14 UTC | |
My solution builds a regexp that looks like the following, minus the lines breaks:
Here it is:
Don't know if it's faster. Known Bugs: 'apples' doesn't count as a fuzzy match for 'apple', but it would count as a fuzzy match for 'apple '. I can fix that if you want. Update: Since we never have any backtracking, we can simplify the above to: Read more... (1406 Bytes)
| [reply] [Watch: Dir/Any] [d/l] [select] |
by tall_man (Parson) on Nov 11, 2004 at 00:32 UTC | |
| [reply] [Watch: Dir/Any] |
by ikegami (Patriarch) on Nov 11, 2004 at 01:00 UTC | |
I must have missed that part of the problem description. The first version should work fine without the anchors. There is a problem with removing the anchors however (and not just in my solution, I think). It would be possible to find a match with fuzzies while there's an exact match later on in the string. | [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by dragonchild (Archbishop) on Nov 11, 2004 at 00:50 UTC | |
Translating what you're doing into set theory, you're attempting to do something like
Obviously, the issue is the off_by() function is what we're searching for. But, what we're really looking for is the set of strings that are off-by-N. So, we create additional columns. We would need a table something like: Then, we can do something like Some intelligent indexing and we're off to the races! Assuming you have enough storage, this solution returns in near-constant time. Of course, the key is storage. You're talking about 25 character strings. Let's assume we're restricted to A-Z only (which will help a bunch). So, you need one row for off_by=0. off_by=1 means you need 25 * 25, or 625 rows. off_by=2 means you need an additional 625 * 24 * 25 rows, or 375_000 rows. So, to store off_by=1 and off_by=2, you need a total of 375_626 rows for each string. At 51 bytes per row, or 18.26MB. Per string you want to store. That's not good. However, there are a number of optimizations one can do. The easiest one is that many of the strings in your dictionary are going to be off_by's of other strings. So, you are already storing the off_by, if you could only link the two. So, some sort of XREF table may be in order. You have one table linking the string to some integer ID. Then, you have a cross-reference containing the IDs of the two strings and their off_by value. Now, this doesn't get you _all_ the different off_by values, but you can do something else - you can store all your words, then pre-compute their off_by=1 and off_by=2. You can then store a third column in the main table as to which dictionary(s) this word belongs to. (This accidentally solves your multiple dictionary issue.) Now, we're still trading space for time. In this case, a LOT of space for, potentially, quite a bit of time. But, it's still a design decision I would take a serious look at. Doing some sort of pre-calculation like I'm describing would bring you to, at worst, some O(logN), and probably very close to O(1). Being right, does not endow the right to be rude; politeness costs nothing. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by johnnywang (Priest) on Nov 11, 2004 at 01:34 UTC | |
How about something simple minded: for each substring of the target, compute the difference with the search string, if the diff is less than 3, accept it. This will only find out matches that have at most two letter difference (but not missing): Output as follows, diffs are in front, -1 show not matching fuzzily.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by tachyon (Chancellor) on Nov 11, 2004 at 03:56 UTC | |
[reply] [Watch: Dir/Any] | |
by perlcapt (Pilgrim) on Nov 11, 2004 at 04:06 UTC | |
[reply] [Watch: Dir/Any] | |
by BrowserUk (Patriarch) on Nov 11, 2004 at 04:33 UTC | |
From a fairly quick perusal of the options, I don't think agrep will help much, except maybe as a pre-filter. Maybe I missed some things in amongst the six help 'screens'? | [reply] [Watch: Dir/Any] |
by halley (Prior) on Nov 12, 2004 at 19:10 UTC | |
Somewhere in my files I have the follow-up to this paper, which allows for affine weighting for various symbols. For example, you might say that vowels are more interchangeable than consonants, if you're looking for fuzzy matches in the pronunciation problem space. I think the author of that paper went on into bioinformatics in a big way after that. -- | [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Patriarch) on Nov 12, 2004 at 02:33 UTC | |
Just in case Itatsumaki should happen back this way :) Let's put this problem into a little perspective. Using the regex engineTo use regex to test for all of: Requires 1 + 25 (N) + 300 (25c2) = 326 different regexes. Applying this mechanism to 1,000 25-ers, across 1,000 sequences of 1000-chars requires 326,000,000 calls to the regex engine[1]. If Perl could invoke the regex engine 10 times / millsecond[2], that would take just over 9 hours. If you extend this to 100,000 25-ers, applied to 30,000 sequences of 1000-chars each, then the runtime will be just over 3 years! If you double the number of 25-ers, you double the time. If you double the average length of the strings, you slightly less than double the time. However, if you allow for up to 3 mismatch characters, the number of regex required goes from 326 to 2626, and you have multiplied the time required by a factor of 8 giving the total time required to just under 25 years. Allowing for 4 mismatches and you're in the region of 145 years. Note: In this simplified scenario, no attempt has been made to record where the matches are found, nor which level of fuzziness matched. Building an index of substringsStarting with the premise that every 1-miss match must contain at least a 12-character exact match. Using the same 100,000 x 30,000 x 1000 scenario from above. To index all the 12 character substrings would require 1000 - 11 (989) x 30,000 index entries containing the 12-byte substring + 2-byte (min) sequence number + 2-byte (min) offset. That's 450 MB of index which should easily fit into memory on any half way decent machine. But, in it's raw form, it doesn't buy you much. Even sorted, and using a binary chop to look up your data is going to take upto 19 chops to locate each item. At say 10 microsecond per chop (way optimistic) this doesn't sound too bad. 100,000 25-ers, break up into 14 x 12-char substrings each. Giving 14 * 100,000 * 19 = 266 seconds spent on lookup. But what have we achieved? We may have succeeded in rejecting some of the sequences as not containing any of the 25-ers--but it is unlikely. The chances that a 1000 character string made up from A,C,G or T will contain any given 25-char substring is around 67 million times less likely than it will contain a given 12-char substring. That's difficult to comprehend, but by reducing the length of the substring from 25 to 12, you are 67 million time less likely to be able to reject a given 1000-char sequence on the basis of it not containing the 12-char substring! Reduce the string to 7 or 8 chars (for the 2-misses match) and your effective rate of elimination drops through the floor. And once we have located the index entry that corresponds to the 12-byte sequence, we will still need to use the sequence number/offset information to go out and read a 38 byte substring from the sequence in order to verify the match!
If the Xs represent our known match 12-chars, the other 13-chars of our match could be at either end. Hence the need to read 38 bytes from the sequence. But each of our 100,000 25-ers could fit around those 12-bytes in 14 different ways. That's fourteen different regexes we need to apply to that 38-byte substring (that needed to be read from a variable length file!) for every one of our 29,000,000+ 12 bytes substrings that matched. Sorry, but the math involves probabilities here and the whole lot got beyond my ability to keep straight. The upshot is that for even a 1-miss match, the index has done little or nothing to reduce the work required, and in many ways has increased it substaintially. By the time you get to 2-missed char matches, the numbers become astronomical. This option goes nowhere fast. "Use a database!"Indexing via a database (of any form) simply exascerbates the above problems. The size of the database required to index a 30,000 x 1000-char sequences is huge. Building the indexes would take on the order of weeks, and the recall would be in the order of seconds. And for all that done, you still have to do most of the regex work yourself locally. If anyone feels like contesting this conclusion, please show me the code--on realistically sized datasets. I may not be the worlds greatest DBA, but my limited attempts to explore the possibilities of using a DB and SQL to achieve this indicate that it would be 2 or 3 orders of magnitude slower than the flat files and regex solution above. That is to say, I estimate that the 100,000 x 30,000 x 1000 x 1 mismatch scenario above would require 30 years of processing time. And that does not include the time required to build the DB. AGREP and similar utilitiesI spent an inordinate amount of time readng the help and playing with 2 different versions of agrep. My conclusion is that (as I indicated here), there is no way to utilise this tool for this purpose. It simply does not have any options to retrieve the sequence number/offset/fuzziness information required. If anyone thinks they know better, I'd be all too pleased to see the working solution. Stringy XORApplying the XOR method I describe elsewhere, the 1000 x 1000 x 1000 scenario I described takes 28.5 minutes.
Code at end That gives a time of 1.723 microseconds per 25x25 fuzzy comparison. Applied to the 100,000 x 30,000 x 1,000 scenario, that would require 59 days, 9 hours. That's just over 5% of the time (or 1,845% quicker) than the most optimistic estimate for the best of the above. And the real trick is that if you increase the number of mis-matched characters to 3 or 10 or 20, the time required remains the same. For the 3-mismatch scenario, that means 0.65 % of the runtime or 15,000 % faster. For the 4-mismatch scenario, that means 0.11 % of the runtime or 89,000 % faster. After that, the numbers begin to get silly :) Further, I think that the XOR method could be improved. If the sequences and 25-ers are limited to the ACGT patterns, rather than containing other characters representing unknowns (Xs) and other letters representing proteins(?) and stuff, then you could pack 4 bases into each character and reduce the length of the strings by a factor of 4. The downside is that you have more work to determine the fuzziness factor. So, maybe not. BLAST and similar tools.I'm not really qualified to comment as I do not understand the documentation for the BioPerl bundle. What I will say is that given the depth and complexity of the hierarchy of modules; the known overhead of method lookups combined with the less than sparkling performance of Perl's sub calling, I fail to see how anything in the bundle would come close to outperforming the regex engine and flat files. Maybe some BioPerl afficionado will happen by and post a 20 line solution that outstrips everything here--but I installed the bundle a while ago, and I couldn't even work out where to start. Then again, I'm not a biochemist. [1]Yes. This number of individual calls to the regex engine could be reduced by combining the regexes into one big regex using PreSuf or similar tools, but given the nature of the regexes involved, the complexity of the output from combining 326 regexes into one big regex is likely to slow you down rather than speed things up. To see what I mean, try generating the 326 regexes required for the 2-mismatches in a 25 character ACGT string. The OPs post has code to do this. The combine them into a single alternate-| regex. Then run your favorite regex optimiser. Then time a few iterations of the result against a 1000-char ACGT sequence. [2]This is very optimistic even for the simple cases:
| [reply] [Watch: Dir/Any] [d/l] [select] |
by hv (Prior) on Nov 12, 2004 at 11:48 UTC | |
If you double the number of 25-ers, you double the time. Yes. If you double the average length of the strings, you slightly less than double the time. No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string. Of course there are two ways of allowing for fuzziness: replace /x/ with /[^x]/, or replace it with /./; the latter approach means you don't also need the 0- and 1-mismatch patterns, leaving 300 patterns (for length-25, up to 2 errors) or 1225 (for length-50). Hugo | [reply] [Watch: Dir/Any] [d/l] [select] |
by BrowserUk (Patriarch) on Nov 12, 2004 at 12:13 UTC | |
No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string. Sorry, I was lapse in my language. Instead of "...length of the strings...", I should have said sequences. That is, if you double the length of the sequences being searched, from 1000 to 2000 chars (average), but search for the same 25-char needle, then the time taken is slightly less that double. I agree if the length of the needles double, the number of regexes required close to quadruples. As I understand the problem, using /./ for the fuzziness is the right thing. However, whilst using a regex with 2 wildcards will match anythng that the same regex with only one wildcard would match, the problem with discarding the latter is that you will no longer be able to record how fuzzy the match was. Other than perhaps capturing the wildcards and then using substr to determine which of the wildcards would have matched had it not been wild. Overall, the slowdown through capturing + the additional work determining the fuzz factor, it is probably quicker to retain the 1-mis and 2-mis regexes? | [reply] [Watch: Dir/Any] |
by husker (Chaplain) on Nov 12, 2004 at 15:02 UTC | |
I think it's interesting that the XOR method, which appears to be the most efficient, and widens it's lead as the problem space expands, corresponds more closely to how nature itself "searches" and "parses" amino acid strings than any of the other methods. Our sensibilities ought to help us realize that if millions of years of evolution have selected a solution to a problem, there must be something useful there. | [reply] [Watch: Dir/Any] |
by demerphq (Chancellor) on Nov 12, 2004 at 19:57 UTC | |
Well BrowserUk, i just finished some testing of my approach. I was able to search a dictionary of 30 million words for one of 1245 possible fuzzy matches in 30 minutes on 555 mhz pc. Admittedly this was anchored at the first char not a match "in string" as it were. Considering your extremely harsh comments in the CB id be very happy to run whatever code you choose against my code to see which is faster. The code takes as an argument a string of digits and then finds all 0 1 or 2 char mismatches of that string in the dictionary with the correct number of digits. If you provide code so that I can do the equivelent of
Then ill run your code against mine with the same test set, etc and we will see whos solution is faster. Since my solution represents a special case of the general problem yours should perform even better than you have suggested here so I expect that youll win. But until you have dont you think you should keep your comments at least polite?
--- demerphq | [reply] [Watch: Dir/Any] [d/l] |
by BrowserUk (Patriarch) on Nov 14, 2004 at 03:57 UTC | |
Why tries won't work for this application.For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys. Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster. The problems.To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT. Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters. For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's Total 67 keys. Picking one of the 256 possible keys, I'll use 'ACGT'. The keys we need to put into our trie in order to fuzzy match 'ACGT' are:
Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates. Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.
In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you. And there's the rubOnce you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match. So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched. Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source. So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched. With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match. MegaTrieNow it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting. The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss). Fast pre-elimination?At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone. That means that there is no saving to be had using a Trie--even for this limited purpose. Determanistic Finite-state AutomatonThe ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*. *It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.) This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination. The problems include:
The point of all this?Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it. Secondly, the next time someone calls you to question over something you have posted,
FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know. Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:
Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications? | [reply] [Watch: Dir/Any] [d/l] [select] |
by demerphq (Chancellor) on Nov 17, 2004 at 21:42 UTC | |
by BrowserUk (Patriarch) on Nov 18, 2004 at 08:24 UTC | |
| |
by BrowserUk (Patriarch) on Nov 27, 2004 at 13:24 UTC | |
| |
by tall_man (Parson) on Nov 14, 2004 at 23:02 UTC | |
by BrowserUk (Patriarch) on Nov 14, 2004 at 23:29 UTC | |
by BrowserUk (Patriarch) on Nov 12, 2004 at 20:41 UTC | |
demerphq You are wrong. How long were your "words"? Less than 25 characters? For the two-miss scenario, matching each of 100,000 25-character needles against each of 30,000 x 1000-char strings requires: 326 * 100,000 * 976 * 30,000 comparisons = 954,528,000,000,000 comparisons. Your 30,000,000 * 1275 = 38,250,000,000 comparisons. Your rate of comparisons is 21,250,000/second. Your runtime to run the 100,000 x 30,000 x 1000 is: 1 year, 5 months, 4 days, 21 hours, 29 minutes, 24.7 seconds. For the 3-miss scenario the numbers are: 2626 * 100,000 * 976 * 30,000 = 7,688,928,000,000,000. Your runtime to run the 100,000 x 30,000 x 1000 is: 11 years, 5 months, 20 days, 2 hours, 51 minutes, 45.88 seconds. For the 4-miss scenario the numbers are: 28,252 * 100,000 * 976 * 30,000 = 82,721,856,000,000,000. Your runtime to run the 100,000 x 30,000 x 1000 is: 123 years, 4 months, 9 days, 17 hours, 27 minutes, 3.46 seconds. As for your challenge. I have published my code. Where is yours? Try applying your method to real world data. If you have somewhere I can post the 1000 x 1000-char randomly generated sequences (994 K) and the 1000 x 25-char randomly generated search strings (27 kb) then I will send them to you so that you can time the process of producing the required information of: Sequence no/ offset/ fuzziness (number of mismatched characters) that my published test code produces in 28.5 minutes. Then, and only then, when we are comparing eggs with eggs, will there be any point in continuing this discussion. | [reply] [Watch: Dir/Any] |
by demerphq (Chancellor) on Nov 12, 2004 at 21:19 UTC | |
| |
by demerphq (Chancellor) on Nov 12, 2004 at 23:48 UTC | |
by BrowserUk (Patriarch) on Nov 13, 2004 at 00:36 UTC | |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by erix (Prior) on Nov 11, 2004 at 19:31 UTC | |
I would be really thankful if those who post bioinformatics questions would say what bioperl has to offer for their problem. Or, if they have not looked for it, say that. This is a (general) request, not criticism. The problem is interesting enough. :)
| [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by NiJo (Friar) on Nov 12, 2004 at 22:47 UTC | |
| [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by demerphq (Chancellor) on Nov 12, 2004 at 16:00 UTC | |
Update:Please dont forget to note the sentence at the end which points out that I got muddled and was thinking of only when the string was matched at the start. Sorry. I might approach the problem by first converting my search string into all of the possible other strings that I will consider to be a fuzzy match. Then build a trie to represent this structure. Then use the trie to search the dictionaries. (You may want to code the Trie in Inline::C.) This will outperform regexes in my experience (Trie lookup is pretty well optimal, O(KeyLen)~O(1),). If i had to do these searches a lot and the size of my alphabet was reasonable, i would look at producing fingerprints of my dictionary (ie, all words with the same letters in them (regardless of counts) would have the same fingerprint"). Assuming you have a fast way to get the list associated with a given fingerprint you should have cut your search space vastly as a single XOR against the fingerprint will tell you if a match is possible. Even if the alphabet was small (like ACTG) you could use tuples for your fingerprinting algorithm, so for instance assuming you use an integer for the fingerprint you could assign one bit to each possible combination. I could probably put together an implementation that worked on a base 10 alphabet, (ie Numeric Digits) and searched a 30 million record dictionary for matches if you provided a routine to generate the match list. Unfortunately i couldnt show you the code, but the results may tell you if the techniques were appropriate. Incidentally afaict the upper time of this approach would be 25 * N where N is the number of words in your dictionary. Assuming you fingerprint then the time will be much lower than that. Thus the algorithm would be O(N). Which im pretty sure is faster than anything else posted here. An alternative approach would be to store all of your dictionaries in one great big trie and then reverse the process, looking up each fuzzy word in the trie and recording the words matched. This would only be suitable if the density of the data stored in the dictionaries was high and/or memory was abundant, but would have the advantage of being almost instantaneous to search. Assuming that 1 word had 300 fuzzy equivelents the structure should be searchable in 300*25 timesteps regardless of dictionary size. Gah, I jsut realized that the above only applies if you need to match from the front. If you need to match from any position in the dictionary words then you need to go beyond a trie.
--- demerphq | [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by Anonymous Monk on Nov 12, 2004 at 18:19 UTC | |
| [reply] [Watch: Dir/Any] |
Re: Fuzzy Searching: Optimizing Algorithm Selection
by ambrus (Abbot) on Jun 01, 2008 at 18:17 UTC | |
Get a perl 5.010. Download the source of re::engine::TRE. Download the full source of the TRE library. Replace the reduced source that comes with the module with the full source. Change the configuration so TRE will be built with fuzzy matching support enabled. Build, test, debug, fix, repeat. You get fast fuzzy matching that enables you to do exactly what you want. | [reply] [Watch: Dir/Any] |