more useful options | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
My first thuoght was to use a database (as it seems to always be). Oracle has the Context engine and MySQL has FULLTEXT indices. But, unfortunately, MySQL's FULLTEXT engine is based around searching paragraphs and Oracle's engine is as well, plus being really really hard to work with. But, we can still harness the power of the RDBMS here, just with a little thinking and some pre-processing.
Translating what you're doing into set theory, you're attempting to do something like
Obviously, the issue is the off_by() function is what we're searching for. But, what we're really looking for is the set of strings that are off-by-N. So, we create additional columns. We would need a table something like: Then, we can do something like Some intelligent indexing and we're off to the races! Assuming you have enough storage, this solution returns in near-constant time. Of course, the key is storage. You're talking about 25 character strings. Let's assume we're restricted to A-Z only (which will help a bunch). So, you need one row for off_by=0. off_by=1 means you need 25 * 25, or 625 rows. off_by=2 means you need an additional 625 * 24 * 25 rows, or 375_000 rows. So, to store off_by=1 and off_by=2, you need a total of 375_626 rows for each string. At 51 bytes per row, or 18.26MB. Per string you want to store. That's not good. However, there are a number of optimizations one can do. The easiest one is that many of the strings in your dictionary are going to be off_by's of other strings. So, you are already storing the off_by, if you could only link the two. So, some sort of XREF table may be in order. You have one table linking the string to some integer ID. Then, you have a cross-reference containing the IDs of the two strings and their off_by value. Now, this doesn't get you _all_ the different off_by values, but you can do something else - you can store all your words, then pre-compute their off_by=1 and off_by=2. You can then store a third column in the main table as to which dictionary(s) this word belongs to. (This accidentally solves your multiple dictionary issue.) Now, we're still trading space for time. In this case, a LOT of space for, potentially, quite a bit of time. But, it's still a design decision I would take a serious look at. Doing some sort of pre-calculation like I'm describing would bring you to, at worst, some O(logN), and probably very close to O(1). Being right, does not endow the right to be rude; politeness costs nothing. In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection
by dragonchild
|
|