Re^3: Database Search Format Engine

You should really use an existing search engine for that, they contain all the logic you need. There are various other search engines out there that I haven't mentioned, I'm sure you'll find one that works for you.

I will be working with the Asian languages, and therefore KinoSearch is out

I know nothing about indexing Asian languages, so out of idle curiosity I wonder what's the issue that KinoSearch has with them (and mysql doesn't). Is there a simple explanation for that?

Comment on Re^3: Database Search Format Engine

Replies are listed 'Best First'.
Re^4: Database Search Format Engine by creamygoodness (Curate) on Jun 18, 2009 at 04:21 UTC
The stable branch of KinoSearch (0.165) doesn't handle UTF-8 properly. You need the dev branch for that (0.20_01 and above). For Asian languages, you absolutely need UTF-8, or support for native encodings like Shift-JIS. Tokenizing is also quite a challenge for Asian languages, particularly Japanese, and KinoSearch doesn't have a dedicated CJK tokenizer class or anything like that. It's on the todo list, but not very high -- I'm more concerned with making sure that the framework will allow others to write high-performance KSx extensions than with writing everything myself.	[reply]
Re^4: Database Search Format Engine by Polyglot (Chaplain) on Jun 17, 2009 at 23:46 UTC
moritz, This is not an issue of indexing. In fact, this should be compatible with most any search indexing system. MySQL supports the Asian languages well enough to satisfy me. The difficulty here is more of a Perl problem. The issue is that of reformating the search from a few search words into a Mysql "SELECT * FROM MyTable WHERE ..." type query. The core of the Perl issue seems to revolve around word-boundary issues. The Asian languages run all words together, so that a sentence appears as if it were one word (i.e. no white space to delimit words). The \w, \b, \d, etc. are supposed to be compatible with any language, but in actual practice, they have shortcomings when dealing with the double-byte character word boundaries. I have had to replace \w in my code for \p{...} type expressions. Kino search lists its language compatibilities under "Features" as: * Full support for 12 Indo-European languages. My first efforts at making this program work on Chinese also failed miserably. I was disappointed that the Perl regex would not work as it was supposed to according to the documentation I had found. (I had used \w in the beginning.) So, for KinoSearch to have the same flaw would not surprise me at all. Most programmers do not purposely avoid the common regex tokens just so that they can be certain their code will be compatible with any language. Who knows...maybe I'm not reinventing this wheel after all? Blessings, ~ Polyglot ~	[reply]


Keep It Simple, Stupid
	PerlMonks