Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Convert international characters to plain ASCII

by Oberon (Monk)
on Apr 04, 2012 at 01:12 UTC ( [id://963341]=perlquestion: print w/replies, xml ) Need Help??

Oberon has asked for the wisdom of the Perl Monks concerning the following question:

O masters of international encoding:

Perl veteran, but Unicode n00b. Let's say I have a title in a foreign language, say: "Vals På Vinkelgränd", but I need to convert it to "ASCII": that is, I want to get "Vals Pa Vinkelgrand" out of it. What Perl module can do what I want? (Note: This will be a personal script, and it will always run on 5.14.2 or higher, so I'm not worried about Unicode bugs/weird implementations in older Perls.)

For the curious, here's the application: I have a bunch of MP3 files, and many of them have these international characters (the example title is from the Movits! album Äppelknyckarjazz). I want to script a way to turn the song titles into filenames. But, unfortunately, I'm using Dropbox, and apparently its filesystem doesn't deal with Unicode characters in filenames (or filenames that differ only by case, FTM--perhaps they have a giant DOS server farm :-/ ). So, every time I store a filename with a Unicode character in it in my Dropbox folder, it comes out all garbled on the other side. So I figured the simplest thing would be to just give up, and leave the full Unicode strings in the MP3 tags, but turn the filenames into "plain ASCII." And I'd like to script this solution.

I humbly await enlightenment.

  • Comment on Convert international characters to plain ASCII

Replies are listed 'Best First'.
Re: Convert international characters to plain ASCII
by graff (Chancellor) on Apr 04, 2012 at 02:04 UTC
    You want Unicode::Normalize, and you want to use the NFD() function to convert a string to its "canonical decomposition", which means that all the single-character code points that involve a letter plus a diacritic will be converted to the bare letter followed by the separate "combining form" version of the diacritic mark.

    Once you have the string in that form, you get rid of the diacritic marks (leaving the letters in place) as follows:

    s/\pM+//g;
    (See the description of the "\p" regex options in perlunicode, perluniprops and perlre.)

    Update: I forgot to mention -- even after taking care of the diacritic marks, be aware that you are likely to still have some non-ASCII characters left behind (i.e. things that don't involve an ASCII letter plus a diacritic mark, but are letter or punctuation that fall outside the ASCII range). You might need to tailor some ad-hoc replacements for those if you really need the data to be coherent in an ascii-only environment.

Re: Convert international characters to plain ASCII
by moritz (Cardinal) on Apr 04, 2012 at 04:58 UTC
Re: Convert international characters to plain ASCII
by remiah (Hermit) on Apr 04, 2012 at 02:40 UTC
    I am not sure this is what you want, but I tried like this.
    use strict; use warnings; use utf8; use charnames (); #create replace table with charname my ($str, %table); $str='Vals På Vinkelgränd'; for (129 .. 255 ){ #iso-8859-1 128 to 255 my $name=charnames::viacode($_); next if (! defined ($name)); #in case "LATIN CAPITAL LETTER A WITH..." => maybe "A" is replace +letter? if ( $name =~ m/\w+\s+\w+\s+\w+\s(\w+)\s+/ ){ $table{chr($_)} = $1; } } #replace print "$str\n"; $str =~ s/(.)/exists($table{$1}) ? $table{$1}: $1/eg; print "$str\n"; __DATA__ this prints Vals På Vinkelgränd Vals PA VinkelgrAnd
    Or how about escape it with URI module? regards.

    update: I also tried graff's way with Unicode::Normalize. It worked like a charm.

Re: Convert international characters to plain ASCII
by Khen1950fx (Canon) on Apr 04, 2012 at 06:51 UTC
    Following moritz's suggestion, I used Text::Unidecode and utf8:
    #!/usr/bin/perl -l use strict; use warnings; use utf8; use Text::Unidecode; binmode STDOUT, ":encoding(utf8)"; my $guess1 = unidecode("Vals P\403 Vinkelgr\403nd"); my $guess2 = print unidecode($guess1);
    Note: It works on my terminal, but not on perlmonks.org. Sorry.
Re: Convert international characters to plain ASCII
by DrHyde (Prior) on Apr 04, 2012 at 10:50 UTC

    Dropbox most certainly does handle "international characters" in filenames - I've had things there with Russian filenames, for example. Not sure what encoding it was though. This was on OS X, so probably UTF-8.

    Your filenames getting garbled is probably because you're expecting different encodings in different places. In my case because I *only* used OS X and iOS, it was the same encoding everywhere.

    As for the case-insensitivity - this is probably deliberate so as not to terminally confuse clients using Windows or using case-insensitive versions of HFS+.

Re: Convert international characters to plain ASCII
by tobyink (Canon) on Apr 04, 2012 at 15:41 UTC

    An alternative might be to use Punycode. This is an ASCII-compatible encoding for Unicode. It essentially strips out non-ASCII characters, and adds a bunch of base64-like crud to the end of the string.

    perl -Mutf8::all -MEncode -MEncode::Punycode -E'say encode(Punycode => + "It cost me €200!")'

    It's not as pretty as the technique you suggest, but has the advantage of being completely reversible. (i.e. you can decode it and back the original string.)

    Punycode happens to be the encoding used to store non-ASCII names in the DNS, and so it is best known for just encoding domain names. However it can in fact be applied to arbitrary Unicode strings.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Convert international characters to plain ASCII
by jeffa (Bishop) on Apr 04, 2012 at 21:24 UTC

    Believe it or not ... this approach works:

    use utf8; $string =~ tr/åä/aa/s;
    You can fill in the rest of the "decorated" latin characters as needed as well play with capitalization (lower casing etc.) to achieve your desired results.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Convert international characters to plain ASCII
by Oberon (Monk) on Apr 14, 2012 at 06:11 UTC

    Thanks everyone for all the great replies! Sorry for the long delay in responding; got a fresh newborn over here. :-)

    I tried most of the methods suggested; graff's idea of using Unicode::Normalize and a s///g works well, but I eventually went with moritz's Text::Unidecode for simplicity: it does exactly what I want in one step. ++ to both you guys!

    @DrHyde: You may very well be right about the encoding issues. I do have to transfer (occasionally) between Linux and Windows machines (although less and less Windows these days), so perhaps that was the source of the problem. I just decided it was easier to strip the accented characters for the filenames.

    @jeffa: Sure, I knew I could use tr, but I was looking for something that wouldn't require me to anticipate every international character I might run across. So far, Text::Unidecode is working great for me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://963341]
Approved by planetscape
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-26 01:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found