Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: Database Search Format Engine

by moritz (Cardinal)
on Jun 17, 2009 at 17:24 UTC ( [id://772484]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Database Search Format Engine
in thread Database Search Format Engine

You should really use an existing search engine for that, they contain all the logic you need. There are various other search engines out there that I haven't mentioned, I'm sure you'll find one that works for you.
I will be working with the Asian languages, and therefore KinoSearch is out

I know nothing about indexing Asian languages, so out of idle curiosity I wonder what's the issue that KinoSearch has with them (and mysql doesn't). Is there a simple explanation for that?

Replies are listed 'Best First'.
Re^4: Database Search Format Engine
by creamygoodness (Curate) on Jun 18, 2009 at 04:21 UTC

    The stable branch of KinoSearch (0.165) doesn't handle UTF-8 properly. You need the dev branch for that (0.20_01 and above). For Asian languages, you absolutely need UTF-8, or support for native encodings like Shift-JIS.

    Tokenizing is also quite a challenge for Asian languages, particularly Japanese, and KinoSearch doesn't have a dedicated CJK tokenizer class or anything like that. It's on the todo list, but not very high -- I'm more concerned with making sure that the framework will allow others to write high-performance KSx extensions than with writing everything myself.

Re^4: Database Search Format Engine
by Polyglot (Chaplain) on Jun 17, 2009 at 23:46 UTC
    moritz,

    This is not an issue of indexing. In fact, this should be compatible with most any search indexing system. MySQL supports the Asian languages well enough to satisfy me. The difficulty here is more of a Perl problem.

    The issue is that of reformating the search from a few search words into a Mysql "SELECT * FROM MyTable WHERE ..." type query.

    The core of the Perl issue seems to revolve around word-boundary issues. The Asian languages run all words together, so that a sentence appears as if it were one word (i.e. no white space to delimit words). The \w, \b, \d, etc. are supposed to be compatible with any language, but in actual practice, they have shortcomings when dealing with the double-byte character word boundaries. I have had to replace \w in my code for \p{...} type expressions.

    Kino search lists its language compatibilities under "Features" as:

    * Full support for 12 Indo-European languages.

    My first efforts at making this program work on Chinese also failed miserably. I was disappointed that the Perl regex would not work as it was supposed to according to the documentation I had found. (I had used \w in the beginning.)

    So, for KinoSearch to have the same flaw would not surprise me at all. Most programmers do not purposely avoid the common regex tokens just so that they can be certain their code will be compatible with any language.

    Who knows...maybe I'm not reinventing this wheel after all?

    Blessings,

    ~ Polyglot ~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://772484]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-20 00:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found