|Perl: the Markov chain saw|
Using the unicode61 tokenizer in DBD::SQLiteby elef (Friar)
|on Jan 05, 2014 at 19:04 UTC||Need Help??|
elef has asked for the
wisdom of the Perl Monks concerning the following question:
Hi all, I'm trying to get the unicode61 tokenizer working in DBD::SQLite. The purpose is to get correct Unicode case folding, i.e. for SQLite to know that Á is the upper-case version of á and treat them as such (return case-insensitive hits from an FTS4 table on MATCH queries).
According to http://www.sqlite.org/fts3.html#tokenizer:
"The "unicode61" tokenizer is available beginning with SQLite version 3.7.13. Unicode61 works very much like "simple" except that it does full unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators."
I just updated DBD::SQLite to 1.40 and made sure I have SQLite version 3.7.17.
Yet, when I try to run a CREATE VIRTUAL TABLE mytable USING fts4 (tokenize=unicode61) I get: "DBD::SQLite::db do failed: unknown tokenizer: unicode61".
Was SQLite compiled without enabling the unicode61 tokenizer? (Some sources mention compiling sqlite with SQLITE_ENABLE_FTS4_UNICODE61 in order to get this functionality.) Do I have any options here?