Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Database Search Format Engine

by Polyglot (Monk)
on Jun 17, 2009 at 16:55 UTC ( #772470=note: print w/ replies, xml ) Need Help??


in reply to Re: Database Search Format Engine
in thread Database Search Format Engine

I should add that, for my needs, I will be working with the Asian languages, and therefore KinoSearch is out. Having looked again at Plucene just now, it is unclear whether or not it would be helpful.

Also, as this is web-based, the user will only be given select rights, and not the full DB rights as you have suggested.

Having the user write the select query themselves would be an interesting option, but most of my users will be the average non-programmers without a clue as to how to do this, so the default behavior must be to accept the search terms as needing to be formatted for the DB query via the script.

Blessings,

~ Polyglot ~


Comment on Re^2: Database Search Format Engine
Re^3: Database Search Format Engine
by moritz (Cardinal) on Jun 17, 2009 at 17:24 UTC
    You should really use an existing search engine for that, they contain all the logic you need. There are various other search engines out there that I haven't mentioned, I'm sure you'll find one that works for you.
    I will be working with the Asian languages, and therefore KinoSearch is out

    I know nothing about indexing Asian languages, so out of idle curiosity I wonder what's the issue that KinoSearch has with them (and mysql doesn't). Is there a simple explanation for that?

      moritz,

      This is not an issue of indexing. In fact, this should be compatible with most any search indexing system. MySQL supports the Asian languages well enough to satisfy me. The difficulty here is more of a Perl problem.

      The issue is that of reformating the search from a few search words into a Mysql "SELECT * FROM MyTable WHERE ..." type query.

      The core of the Perl issue seems to revolve around word-boundary issues. The Asian languages run all words together, so that a sentence appears as if it were one word (i.e. no white space to delimit words). The \w, \b, \d, etc. are supposed to be compatible with any language, but in actual practice, they have shortcomings when dealing with the double-byte character word boundaries. I have had to replace \w in my code for \p{...} type expressions.

      Kino search lists its language compatibilities under "Features" as:

      * Full support for 12 Indo-European languages.

      My first efforts at making this program work on Chinese also failed miserably. I was disappointed that the Perl regex would not work as it was supposed to according to the documentation I had found. (I had used \w in the beginning.)

      So, for KinoSearch to have the same flaw would not surprise me at all. Most programmers do not purposely avoid the common regex tokens just so that they can be certain their code will be compatible with any language.

      Who knows...maybe I'm not reinventing this wheel after all?

      Blessings,

      ~ Polyglot ~

      The stable branch of KinoSearch (0.165) doesn't handle UTF-8 properly. You need the dev branch for that (0.20_01 and above). For Asian languages, you absolutely need UTF-8, or support for native encodings like Shift-JIS.

      Tokenizing is also quite a challenge for Asian languages, particularly Japanese, and KinoSearch doesn't have a dedicated CJK tokenizer class or anything like that. It's on the todo list, but not very high -- I'm more concerned with making sure that the framework will allow others to write high-performance KSx extensions than with writing everything myself.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://772470]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2014-07-28 05:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (186 votes), past polls