http://www.perlmonks.org?node_id=694409

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Muffled Monks,

Working on a simple SQL db search using the DBI. Client wants the system to provide "suggestions" of "similar" or "related" words for user to try if zero results for the input search term.

What the client thinks is a little side-request looks to me like a major word-clustering project in itself if I roll my own. I'll need code to take every word in the db and go through a dictionary of similar (ex: input="car", did you mean "care? or scar? or cart?.....) or related (vehicle, automobile...) words, compile all that into a nother table for recall.......

Or is there something out there already that has figured all this out?

Thanks.




Forget that fear of gravity,
Get a little savagery in your life.
  • Comment on Module to provide suggested terms for search?

Replies are listed 'Best First'.
Re: Module to provide suggested terms for search?
by samtregar (Abbot) on Jun 27, 2008 at 18:35 UTC
    That does sound tough. One possible algorithm you could try is Text::Soundex. I seem to remember some coverage in Joe Celko's SQL For Smarties book.

    You might point out that Google will sell them a search appliance that no doubt has that feature along with lots of other neat Google features they probably want. Or you might not, if you like your job. ;)

    -sam

      Well, that's the thing. I think users are getting used to seeing Google's "Did you mean...?" and imagine it's a trivial piece of technology to ask for.
        And it's your job to disabuse them of that notion. Start with a thought experiment - "How much do you think Google paid to develop that feature?" Most people realize that Google has more money than God, so that might get them thinking...

        -sam

        I work with this kind of chalenge on my PhD thesis. But, considering this is not a complete perl related subject, I will not post about here now. If you or other wanna talk again, send private message.
        On the semantic aproach, it can be done, and is not so difficult. There are some Perl modules to help too.
Re: Module to provide suggested terms for search?
by creamygoodness (Curate) on Jun 28, 2008 at 01:19 UTC

    There are two popular approaches for "suggesting" at the large scale.

    The first involves analyzing a huge volume of prior search queries.

    • Compare the second-to-last query and the last query a user supplied during any short session.
    • Assume that the last was more "successful" in the user's view than the second-to-last.
    • If the query strings are "close" by some measure, such as off by a single character, then the next time someone types in a query that looks like that second-to-last, perhaps we should suggest the query string that the earlier user was apparently satisfied with.

    Of course, it takes an enormous amount of data-crunching to create reliable suggestions using this technique. Furthermore, the suggestions are likely be specific to the search domain, so analyzing e.g. the pirate bay's search query logs would not necessarily yield good suggestions for e.g. apple.com.

    The second approach to suggesting is to find documents which are "nearby" in term-space to the query's term space. For a small scale illustration of vector-based similarity measures, there's an excellent article on perl.com about building a vector space search engine, but for large scale you need to reduce the dimensional space using a tool such as Latent Semantic Analysis, or LSA. LSA and its relatives are typically used to power those "more like this" queries.

    You could also combine the two approaches by using vector similarity as another measure for determining whether the last and next-to-last queries are related.

    None of this does you any good, though, except for explaining to your client what's truly involved in providing high quality suggestions -- because to the best of my knowledge, there are no open source implementations of either of these algorithms. (The patent on LSA expires this year.)

    However, it's possible to use ASpell to provide spell checks, and if you're feeling ambitious, you can feed ASpell with custom dictionaries. Peter Karman wrote Search::Tools::SpellCheck for precisely such an application.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
Re: Module to provide suggested terms for search?
by goupilInside (Sexton) on Jun 27, 2008 at 22:32 UTC

    While not exactly what you are looking for, I thought : couln't you get something approaching using the Soundex and other phonetic algorithms ?

    There seems to be a few different on the CPAN.

    And then, paf ! Serendipity and CPAN search brings you : Text::Approx Which looks a lot like a solution to your problem.

    And of course, if you have a function calculating Levensthein distance database side (which is relatively simple to write as stored procedure when it does not already exists), you could write something like:
    SELECT word FROM words WHERE leven_dist(word,myword) < some_limit

    As an exemple, a postgresql PGSQL version can be found there, a MYSQL stored procedure doing that there, and an Oracle PL/SQL there.

    Please note that I did not check the correctness or efficiency of those implementations.

    An optimisation wasting space vs CPU would be to precompute the most distances for the most commonly used words and typos (my guess being that the most common typos for the most commonly used words would heavily be dominating the use cases).

    Thank you again CPAN.

Re: Module to provide suggested terms for search?
by mpeg4codec (Pilgrim) on Jun 27, 2008 at 20:12 UTC
    You may want to check out Plucene, which is the Perl port of Apache's Lucene. I know the Java version's QueryParser supports fuzzy matching (using the ~ modifier on search terms). Unfortunately, the Perl port is incomplete, so it may not have the feature.

    If worse comes to worse, you can check out Solr, which runs as a Java web service and would allow you to query a Lucene index using XML over HTTP.

      Plucene isn't recomended anymore because its "too slow". KinoSearch is faster.
Re: Module to provide suggested terms for search?
by holli (Abbot) on Jun 27, 2008 at 18:30 UTC
    Acme::Telepathy?


    holli, /regexed monk/
Re: Module to provide suggested terms for search?
by jds17 (Pilgrim) on Jun 27, 2008 at 18:37 UTC
    I have not used Thesaurus::DBI yet, but I would give it a try in case I had your problem.
Re: Module to provide suggested terms for search?
by artist (Parson) on Jun 27, 2008 at 20:28 UTC
    Is there a third party API for this? That would be a very helpful service to millions of websites. You can also try using the search-term for proper spelling with spell-check and present that. Spell-check can use the words in your list to provide proper spelling. You can also combine with other algorithms such as Soundex. You can also prepare your list of related words for a given word with various algorithms off-line and present it to the user.
    --Artist
Re: Module to provide suggested terms for search?
by Sagacity (Monk) on Jun 29, 2008 at 05:02 UTC
    Here's what you can tell your client, with the CFO at the meeting. I can only say that it costs twice as much to re-invent the wheel (It has become my policy anyway.), so get that checkbook out and start on the far right of the check, start putting zero's down going to the left. When you reach the seventh zero, place a one next to it and you can have what you are looking for barring any unforeseen circumstances. Good Luck!
Re: Module to provide suggested terms for search?
by vdrab (Initiate) on Jun 29, 2008 at 13:52 UTC
    If you have a reading knowledge of python, this should get you halfway there... http://norvig.com/spell-correct.html v.
Where should the solution be applied?
by Anonymous Monk on Jun 30, 2008 at 13:53 UTC
    This humble, less-than-nothing monk would suggest that perhaps the original poster wishes to search the database directly with wildcarding, rather than search some database output with Perl. As much as I enjoy working with Perl, a database-based solution would likely perform better.

    Oracle and MySQL have soundex functions. Postgres seems to also, as part of their fuzzystrmatch function. If your client uses Oracle and expects to do this a lot, Oracle provides function-based indices which could precompute each row's soundex value for faster response times.

    Of course soundex only helps with similar words, not with related words (e.g., "boys" and "boysenberry" are similar, while "boys" and "girls" are related). The thesaurus reference sounds like a good idea in that regard.

    I've had occasion to use Perl's soundex function and it works quite well, although there's some performance penalty due to the additional function calls.

Re: Module to provide suggested terms for search?
by clwolfe (Scribe) on Jun 30, 2008 at 20:30 UTC

    For the spelling portion, I've had great luck with Text::Aspell. It's simple and effective. There's also a driver for it for Search::Tools, if you're using that to build your engine.

    For semantic similarity, that's a much different (and harder) problem, and I think the other posts have some good leads for you. But I think if you can roll out a Text::Aspell based solution quickly, your client may realize that's what they want anyway.