Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Convert international characters to plain ASCII

by graff (Chancellor)
on Apr 04, 2012 at 02:04 UTC ( #963348=note: print w/ replies, xml ) Need Help??


in reply to Convert international characters to plain ASCII

You want Unicode::Normalize, and you want to use the NFD() function to convert a string to its "canonical decomposition", which means that all the single-character code points that involve a letter plus a diacritic will be converted to the bare letter followed by the separate "combining form" version of the diacritic mark.

Once you have the string in that form, you get rid of the diacritic marks (leaving the letters in place) as follows:

s/\pM+//g;
(See the description of the "\p" regex options in perlunicode, perluniprops and perlre.)

Update: I forgot to mention -- even after taking care of the diacritic marks, be aware that you are likely to still have some non-ASCII characters left behind (i.e. things that don't involve an ASCII letter plus a diacritic mark, but are letter or punctuation that fall outside the ASCII range). You might need to tailor some ad-hoc replacements for those if you really need the data to be coherent in an ascii-only environment.


Comment on Re: Convert international characters to plain ASCII
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://963348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (12)
As of 2014-10-23 14:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (125 votes), past polls