comment on

You have written some Perl scripts already, and when somebody asks you how to reverse a string, you'll answer: "That's easy, just call reverse in scalar context".

And of course, you're right - if you're only considering ASCII chars.

But suppose you have an UTF-8 environment:

#!/usr/bin/perl
use strict;
use warnings;

print scalar reverse "\noäu";
[download]

The output consists of a "u", two garbage characters, and a newline.

The reason is that "ä", like many other chars, is represented by several bytes in UTF-8, here as 0xC3 0xA4. reverse Works on bytes, so it will produce 0xA4< 0xC3. And that is not legal UTF-8, so the output contains two bytes of garbage.

You can solve this problem by decoding the text strings (read perluniintro and perlunicode for more information):

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
binmode STDOUT, ':utf8';
print scalar reverse "\noäu";
__END__
uäo
[download]

The use utf8; takes care that every string literal in the script is treated as a text string, so reverse (and other functions like uc) will work on codepoint level.

While this example worked, it could just as well fail.

The reason is that there are multiple ways to encode some characters.

Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS.

Converting one representation into the other is called "Unicode normalization".

Bad luck, in our case, reverse doesn't work for the normalized form:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Unicode::Normalize;
use charnames ':full';
my $str = 'Ä';

sub mydump {
    print map { '\N['. charnames::viacode(ord $_) . ']' } 
                split m//, $_[0];
    print "\n";
}

mydump $str;
mydump NFKD($str);
mydump scalar reverse NFKD($str);

binmode STDOUT, ':utf8';
my $tmp = "\nÄO";
print scalar reverse NFKD($tmp);
__END__
\N[LATIN CAPITAL LETTER A WITH DIAERESIS]
\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS]
\N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETTER A]
ÖA
[download]

You can see that reversing a string moves the combining character(s) to the front, thus they are applied to the wrong base character; "ÄO" reversed becomes "ÖA".

(You wouldn't normalize with NFKD here under normal circumstances, in this example it is done to demonstrate the problem that can arise from such strings).

It seems the problem could easily be solved by not doing the normalization in the first place, and indeed that works in this example. But there are Unicode graphemes that can't be expressed with a single Codepoint, and if one of your users enters such a grapheme, your application won't work correctly.

So we need a "real" solution. Since perl doesn't work with graphemes, we need a CPAN module that does:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Unicode::Normalize;
use charnames ':full';
use String::Multibyte;
my $str = NFKD "ÄO";
sub mydump {
    print map { '\N['. charnames::viacode(ord $_) . ']' } 
            split m//, $_[0];
    print "\n";
}

my $u = String::Multibyte->new('Grapheme');

mydump $str;
mydump $u->strrev($str);
binmode STDOUT, ':utf8';
print $u->strrev($str), "\n";
__END__
\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETT
+ER O]
\N[LATIN CAPITAL LETTER O]\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAER
+ESIS]
OÄ
[download]

The String::Multibyte::Grapheme module helps you with reversing the string without tearing the graphemes apart.

(It can also count the number of graphemes, generate substrings with grapheme semantics and more; see String::Multibyte.)

In reply to How to reverse a (Unicode) string by moritz

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks