Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
My first thuoght was to use a database (as it seems to always be). Oracle has the Context engine and MySQL has FULLTEXT indices. But, unfortunately, MySQL's FULLTEXT engine is based around searching paragraphs and Oracle's engine is as well, plus being really really hard to work with. But, we can still harness the power of the RDBMS here, just with a little thinking and some pre-processing.

Translating what you're doing into set theory, you're attempting to do something like

SELECT string FROM some_table WHERE string = ? OR off_by( 1, string, ? ) OR off_by( 2, string, ? )

Obviously, the issue is the off_by() function is what we're searching for. But, what we're really looking for is the set of strings that are off-by-N. So, we create additional columns. We would need a table something like:

string char(25) match char(25) off_by int(1)
Then, we can do something like
SELECT string FROM some_table WHERE string = ? OR ( match = ? and off_by = 1 ) OR ( match = ? and off_by = 2 )
Some intelligent indexing and we're off to the races! Assuming you have enough storage, this solution returns in near-constant time.

Of course, the key is storage. You're talking about 25 character strings. Let's assume we're restricted to A-Z only (which will help a bunch). So, you need one row for off_by=0. off_by=1 means you need 25 * 25, or 625 rows. off_by=2 means you need an additional 625 * 24 * 25 rows, or 375_000 rows. So, to store off_by=1 and off_by=2, you need a total of 375_626 rows for each string. At 51 bytes per row, or 18.26MB. Per string you want to store. That's not good.

However, there are a number of optimizations one can do. The easiest one is that many of the strings in your dictionary are going to be off_by's of other strings. So, you are already storing the off_by, if you could only link the two. So, some sort of XREF table may be in order. You have one table linking the string to some integer ID. Then, you have a cross-reference containing the IDs of the two strings and their off_by value.

Now, this doesn't get you _all_ the different off_by values, but you can do something else - you can store all your words, then pre-compute their off_by=1 and off_by=2. You can then store a third column in the main table as to which dictionary(s) this word belongs to. (This accidentally solves your multiple dictionary issue.)

Now, we're still trading space for time. In this case, a LOT of space for, potentially, quite a bit of time. But, it's still a design decision I would take a serious look at. Doing some sort of pre-calculation like I'm describing would bring you to, at worst, some O(logN), and probably very close to O(1).

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.


In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by dragonchild
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-03-28 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found