Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Convert international characters to plain ASCII

by graff (Chancellor)
on Apr 04, 2012 at 02:04 UTC ( #963348=note: print w/replies, xml ) Need Help??

in reply to Convert international characters to plain ASCII

You want Unicode::Normalize, and you want to use the NFD() function to convert a string to its "canonical decomposition", which means that all the single-character code points that involve a letter plus a diacritic will be converted to the bare letter followed by the separate "combining form" version of the diacritic mark.

Once you have the string in that form, you get rid of the diacritic marks (leaving the letters in place) as follows:

(See the description of the "\p" regex options in perlunicode, perluniprops and perlre.)

Update: I forgot to mention -- even after taking care of the diacritic marks, be aware that you are likely to still have some non-ASCII characters left behind (i.e. things that don't involve an ASCII letter plus a diacritic mark, but are letter or punctuation that fall outside the ASCII range). You might need to tailor some ad-hoc replacements for those if you really need the data to be coherent in an ascii-only environment.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://963348]
and snow settles gently...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2017-02-23 23:10 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (351 votes). Check out past polls.