|Perl: the Markov chain saw|
How to reverse a (Unicode) stringby moritz (Cardinal)
|on Jan 09, 2008 at 21:24 UTC||Need Help??|
You have written some Perl scripts already, and when somebody asks you how to reverse a string, you'll answer: "That's easy, just call reverse in scalar context".
And of course, you're right - if you're only considering ASCII chars.
But suppose you have an UTF-8 environment:
The output consists of a "u", two garbage characters, and a newline.
The reason is that "ä", like many other chars, is represented by several bytes in UTF-8, here as 0xC3 0xA4. reverse Works on bytes, so it will produce 0xA4< 0xC3. And that is not legal UTF-8, so the output contains two bytes of garbage.
The use utf8; takes care that every string literal in the script is treated as a text string, so reverse (and other functions like uc) will work on codepoint level.
While this example worked, it could just as well fail.
The reason is that there are multiple ways to encode some characters.
Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS.
Converting one representation into the other is called "Unicode normalization".
Bad luck, in our case, reverse doesn't work for the normalized form:
You can see that reversing a string moves the combining character(s) to the front, thus they are applied to the wrong base character; "ÄO" reversed becomes "ÖA".
(You wouldn't normalize with NFKD here under normal circumstances, in this example it is done to demonstrate the problem that can arise from such strings).
It seems the problem could easily be solved by not doing the normalization in the first place, and indeed that works in this example. But there are Unicode graphemes that can't be expressed with a single Codepoint, and if one of your users enters such a grapheme, your application won't work correctly.
So we need a "real" solution. Since perl doesn't work with graphemes, we need a CPAN module that does:
The String::Multibyte::Grapheme module helps you with reversing the string without tearing the graphemes apart.
(It can also count the number of graphemes, generate substrings with grapheme semantics and more; see String::Multibyte.)