Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Database Search Format Engine

by moritz (Cardinal)
on Jun 17, 2009 at 16:27 UTC ( #772461=note: print w/replies, xml ) Need Help??

in reply to Database Search Format Engine

Maybe you want a search engine like KinoSearch or Plucene?

Or if you want to give the user full access to your database, just create a database user with limited permissions (like only SELECT, no UPDATE, INSERT etc.), and let them write the SQL themselves.

Or what is it that you want to achieve in the end?

Replies are listed 'Best First'.
Re^2: Database Search Format Engine
by Polyglot (Pilgrim) on Jun 17, 2009 at 16:43 UTC
    I'm trying to achieve a web-based Google-like search engine for users wishing to search a particular library of books in the database. For example, if the book were the King James Bible (no fears of copyright infractions here), the user might enter the following to search for every verse in the Bible which matched:

    (savior OR saviour OR christ OR jesus OR messiah OR "prince of peace" OR "son of god" OR "son of man") AND (michael OR archangel OR prince OR king)

    ...and the code must convert that search query into the appropriate results list from the MySQL database to give back to the user. But the tricky part is reshaping those search terms into a MySQL-compatible select query, which is what my code here can do.


    ~ Polyglot ~

Re^2: Database Search Format Engine
by Polyglot (Pilgrim) on Jun 17, 2009 at 16:55 UTC
    I should add that, for my needs, I will be working with the Asian languages, and therefore KinoSearch is out. Having looked again at Plucene just now, it is unclear whether or not it would be helpful.

    Also, as this is web-based, the user will only be given select rights, and not the full DB rights as you have suggested.

    Having the user write the select query themselves would be an interesting option, but most of my users will be the average non-programmers without a clue as to how to do this, so the default behavior must be to accept the search terms as needing to be formatted for the DB query via the script.


    ~ Polyglot ~

      You should really use an existing search engine for that, they contain all the logic you need. There are various other search engines out there that I haven't mentioned, I'm sure you'll find one that works for you.
      I will be working with the Asian languages, and therefore KinoSearch is out

      I know nothing about indexing Asian languages, so out of idle curiosity I wonder what's the issue that KinoSearch has with them (and mysql doesn't). Is there a simple explanation for that?

        The stable branch of KinoSearch (0.165) doesn't handle UTF-8 properly. You need the dev branch for that (0.20_01 and above). For Asian languages, you absolutely need UTF-8, or support for native encodings like Shift-JIS.

        Tokenizing is also quite a challenge for Asian languages, particularly Japanese, and KinoSearch doesn't have a dedicated CJK tokenizer class or anything like that. It's on the todo list, but not very high -- I'm more concerned with making sure that the framework will allow others to write high-performance KSx extensions than with writing everything myself.


        This is not an issue of indexing. In fact, this should be compatible with most any search indexing system. MySQL supports the Asian languages well enough to satisfy me. The difficulty here is more of a Perl problem.

        The issue is that of reformating the search from a few search words into a Mysql "SELECT * FROM MyTable WHERE ..." type query.

        The core of the Perl issue seems to revolve around word-boundary issues. The Asian languages run all words together, so that a sentence appears as if it were one word (i.e. no white space to delimit words). The \w, \b, \d, etc. are supposed to be compatible with any language, but in actual practice, they have shortcomings when dealing with the double-byte character word boundaries. I have had to replace \w in my code for \p{...} type expressions.

        Kino search lists its language compatibilities under "Features" as:

        * Full support for 12 Indo-European languages.

        My first efforts at making this program work on Chinese also failed miserably. I was disappointed that the Perl regex would not work as it was supposed to according to the documentation I had found. (I had used \w in the beginning.)

        So, for KinoSearch to have the same flaw would not surprise me at all. Most programmers do not purposely avoid the common regex tokens just so that they can be certain their code will be compatible with any language.

        Who knows...maybe I'm not reinventing this wheel after all?


        ~ Polyglot ~

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://772461]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2017-12-17 21:50 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (466 votes). Check out past polls.