Re: Convert international characters to plain ASCII

by graff (Chancellor)
on Apr 04, 2012 at 02:04 UTC ( #963348=note: print w/replies, xml ) Need Help??

in reply to Convert international characters to plain ASCII

You want Unicode::Normalize, and you want to use the NFD() function to convert a string to its "canonical decomposition", which means that all the single-character code points that involve a letter plus a diacritic will be converted to the bare letter followed by the separate "combining form" version of the diacritic mark.

Once you have the string in that form, you get rid of the diacritic marks (leaving the letters in place) as follows:

(See the description of the "\p" regex options in perlunicode, perluniprops and perlre.)

Update: I forgot to mention -- even after taking care of the diacritic marks, be aware that you are likely to still have some non-ASCII characters left behind (i.e. things that don't involve an ASCII letter plus a diacritic mark, but are letter or punctuation that fall outside the ASCII range). You might need to tailor some ad-hoc replacements for those if you really need the data to be coherent in an ascii-only environment.

