|Syntactic Confectionery Delight|
Answer: How do I normalize (e.g. strip) diacritical märks from a Unicode string?by brycen (Monk)
|on Apr 17, 2010 at 07:36 UTC||Need Help??|
Q&A > strings > How do I normalize (e.g. strip) diacritical märks from a Unicode string? - Answer contributed by brycen
Unicode defines a variety of normalization forms (see http://unicode.org/reports/tr15/).
I prefer normalization form NFKD, as it translates more ligatures (though not all, for example the ligature Œ).
First decompose composite characters into their component parts (e.g. letters and diacritical marks), then strip out the marks.
Or with a full example:
Update: I really do mean normalization. ASCIIfying (e.g. encoding) would destroy non-latin text. Normalization preserves Greek, Hebrew, etc.
I am supporting clients in various languages who want the fuzzy matching that stripping diacriticals provides. It might make for the occasional confusion between German bears and bars... but that's much better than missing out on all the potential correct matches. For example in Hebrew vowels are not normally written except for children. Stripping the vowel and pronunciation diacriticals out lets you compare the text as an adult searcher will likely enter it.