http://www.perlmonks.org?node_id=661510

You have written some Perl scripts already, and when somebody asks you how to reverse a string, you'll answer: "That's easy, just call reverse in scalar context".

And of course, you're right - if you're only considering ASCII chars.

But suppose you have an UTF-8 environment:

#!/usr/bin/perl use strict; use warnings; print scalar reverse "\noäu";

The output consists of a "u", two garbage characters, and a newline.

The reason is that "ä", like many other chars, is represented by several bytes in UTF-8, here as 0xC3 0xA4. reverse Works on bytes, so it will produce 0xA4< 0xC3. And that is not legal UTF-8, so the output contains two bytes of garbage.

You can solve this problem by decoding the text strings (read perluniintro and perlunicode for more information):

#!/usr/bin/perl use strict; use warnings; use utf8; binmode STDOUT, ':utf8'; print scalar reverse "\noäu"; __END__ uäo

The use utf8; takes care that every string literal in the script is treated as a text string, so reverse (and other functions like uc) will work on codepoint level.

While this example worked, it could just as well fail.

The reason is that there are multiple ways to encode some characters.

Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS.

Converting one representation into the other is called "Unicode normalization".

Bad luck, in our case, reverse doesn't work for the normalized form:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; my $str = 'Ä'; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str); binmode STDOUT, ':utf8'; my $tmp = "\nÄO"; print scalar reverse NFKD($tmp); __END__ \N[LATIN CAPITAL LETTER A WITH DIAERESIS] \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS] \N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETTER A] ÖA

You can see that reversing a string moves the combining character(s) to the front, thus they are applied to the wrong base character; "ÄO" reversed becomes "ÖA".

(You wouldn't normalize with NFKD here under normal circumstances, in this example it is done to demonstrate the problem that can arise from such strings).

It seems the problem could easily be solved by not doing the normalization in the first place, and indeed that works in this example. But there are Unicode graphemes that can't be expressed with a single Codepoint, and if one of your users enters such a grapheme, your application won't work correctly.

So we need a "real" solution. Since perl doesn't work with graphemes, we need a CPAN module that does:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; use String::Multibyte; my $str = NFKD "ÄO"; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } my $u = String::Multibyte->new('Grapheme'); mydump $str; mydump $u->strrev($str); binmode STDOUT, ':utf8'; print $u->strrev($str), "\n"; __END__ \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETT +ER O] \N[LATIN CAPITAL LETTER O]\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAER +ESIS] OÄ

The String::Multibyte::Grapheme module helps you with reversing the string without tearing the graphemes apart.

(It can also count the number of graphemes, generate substrings with grapheme semantics and more; see String::Multibyte.)

Replies are listed 'Best First'.
Re: How to reverse a (Unicode) string
by Juerd (Abbot) on Jan 09, 2008 at 22:03 UTC

    print scalar reverse "\noäu";

    If you entered this using an UTF-8 editor, you forgot to "use utf8;" to notify Perl of this fact.

    You may be dealing with the string "\no\x{C3}\x{A4}u" instead of the intended "\no\x{e4}u"!

    reverse Works on bytes

    reverse works on characters. If you have a bytestring, every character represents the equivalent byte. If you have a Unicode text string, reverse properly reverses based on unicode codepoints.

    You can solve this problem by decoding the text strings

    This suggests that decoding is a workaround. It is not, it is something you should always do when dealing with text data!

    The use utf8; takes care that every string literal in the script is treated as a text string

    Perl has no idea, and cannot be told, what kind your strings are: binary or text. Without "use utf8" you don't necessarily have byte strings, but if you have text strings, they're interpreted as iso-8859-1 rather than utf-8. Note that iso-8859-1 is a unicode encoding -- it just doesn't support all of the characters.

    The rest of your post is accurate, but I wanted to respond to avoid that newbies get a negative feeling about Perl's unicode support from your post. Perl's unicode support is great, but the programmer MUST learn the difference between unicode and utf-8, and the difference between text data and binary data.

      Note that iso-8859-1 is a unicode encoding -- it just doesn't support all of the characters.

      I don't know what you mean by "unicode encoding" (are there encodings that map to non-unicode chars?), but in the perl context it's worth mentioning that iso-8859-1 strings don't follow unicode-semantics by default, the need to be encoded like any other string:

      # this file is stored as latin1 print "ä" =~ m/\w/ ? "Unicode\n" : "Bytes\n"; __END__ Bytes

      Perl's unicode support is great, but the programmer MUST learn the difference between unicode and utf-8, and the difference between text data and binary data.

      Yes, and they have to learn that for any kind of tool that supports Unicode and different encodings.

      And I really like the Perl 6 spec which allows string operations on byte, codepoint and grapheme level ;-)

        I don't know what you mean by "unicode encoding" (are there encodings that map to non-unicode chars?), but in the perl context it's worth mentioning that iso-8859-1 strings don't follow unicode-semantics by default, the need to be encoded like any other string

        It is a unicode encoding, in that after you've decoded the character number, the number maps 1-on-1 to the Unicode space. Don't forget that UTF-8 is just a way of encoding a sequence *numbers*.

        That non-SvUTF8-flagged strings get ASCII semantics in some places, is indeed by design, but that wasn't sufficiently thought through IMO. Note that these strings may get unicode semantics in some circumstances, and ascii semantics in others. The ascii semantics are for charclass and upper-/lower case stuff.

        I consider this a bug in Perl. See also Unicode::Semantics, and expect the bug to be fixed in 5.12.

        And I really like the Perl 6 spec which allows string operations on byte, codepoint and grapheme level ;-)

        Just realise that Unicode strings don't have a byte level :)

Re: How to reverse a (Unicode) string
by Jim (Curate) on Jan 09, 2011 at 22:50 UTC

    Here's a way to reverse a Unicode string using the regular expression character class \X to match Unicode extended grapheme clusters:

    my $edocinU = join '', reverse $Unicode =~ m/\X/g;

    Here's a demonstration using Vietnamese (tiếng Việt) words:

    #!perl
    
    use strict;
    use warnings;
    use utf8;
    
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $Moonshine = "Rượu đế";
    my $enihsnooM = join '', reverse $Moonshine =~ m/\X/g;
    
    print "$Moonshine\n";
    print "$enihsnooM\n";
    
    __END__
    Rượu đế
    ếđ uợưR
    
    LATIN CAPITAL LETTER R
    LATIN SMALL LETTER U
    COMBINING HORN
    LATIN SMALL LETTER O
    COMBINING HORN
    COMBINING DOT BELOW
    LATIN SMALL LETTER U
    SPACE
    LATIN SMALL LETTER D WITH STROKE
    LATIN SMALL LETTER E
    COMBINING CIRCUMFLEX ACCENT
    COMBINING ACUTE ACCENT
    

    [I was forced to use <pre> tags instead of <code> tags here to display the actual Vietnamese characters rather than their HTML character entities.]

Re: How to reverse a (Unicode) string
by Jim (Curate) on Jan 11, 2011 at 04:36 UTC

    The documentation of Perl's reverse function states: "In scalar context, [the reverse function] ... returns a string value with all characters in the opposite order." But it doesn't, at least not for a sufficiently modern, multilingual, Unicode-conformant definition of "character." It reverses Unicode code points, not characters in the usual, well-understood sense of the word.

    One or the other is wrong: the behavior of the reverse function or the reverse function's documentation.

    If I understand the design principles of Perl correctly, the reverse function should properly reverse extended grapheme clusters when the thing being reversed is Unicode text (and Perl understands it is Unicode text), and it should reverse bytes otherwise.

      It reverses Unicode code points

      No. It doesn't know anything about Unicode and there's no requirement for the string to be Unicode text.

      not characters in the usual, well-understood sense of the word.

      In my experience, "character" is the constituent element of a string, and never a grapheme except by happenstance. Let's just say there is no such consensus. Regardless, that's the definition used here.

      If I understand the design principles of Perl correctly, the reverse function should properly reverse extended grapheme clusters

      No. reverse provides a vital string operation. It should not assume the string is Unicode text. Reversing text is also a useful function, but it is not provided by reverse.

      Same goes with substr and length.

      when the thing being reversed is Unicode text (and Perl understands it is Unicode text), and it should reverse bytes otherwise.

      Perl doesn't have a means of "understanding a string is Unicode text".

        The documentation of Perl's reverse function states: "In scalar context, [the reverse function] ... returns a string value with all characters in the opposite order."
        No. It doesn't know anything about Unicode and there's no requirement for the string to be Unicode text.

        The documentation is quite clear: In scalar context, the reverse function operates on strings. If the string is Unicode, it reverses Unicode code points. If the string is in some single-byte character encoding such as ISO 8859-1 (Latin 1), then it reverses those characters. It's really very straightforward.

        You're once again trying to make some esoteric point about the distinction between strings and bytes, and what is Unicode and what is not Unicode. But your peculiar, persistent point isn't relevant here.

        Nothing in my post is incorrect or inaccurate, yet the supercilious tone of your response wrongly implies that something is incorrect. The topic of this discussion is Unicode text, so of course I'm talking about Unicode text in it.

        In my experience, "character" is the constituent element of a string, and never a grapheme except by happenstance. Let's just say there is no such consensus.

        What is a character in a language is well-understood and rarely, if ever, subject to debate. In the case of Unicode, "character" is well-defined, too: It's a "grapheme." It's as simple as that. Read the Unicode Standard.

        There are four characters in the word "Café", not five. When you arrange the four characters in the opposite order, you get "éfaC". This fact is the very basis of this tutorial and discussion.

        No. reverse provides a vital string operation. It should not assume the string is Unicode text.

        Your insistence that there's a difference between "string" and "text" may have some strange basis in the arcane details of the internals of Perl, but it has no relevance to this discussion. The whole premise of this discussion is that we're trying to reverse Unicode strings (Unicode text). And the salient point about Perl's reverse function is that it fails to reverse properly an infinite number of possible Unicode strings.

        Reversing text is also a useful function, but it is not provided by reverse.

        This is a bizarre and incomprehensible statement.

        What's the difference between a "string" and "text" to someone writing a Perl program?

        Explain why you think this…

        use utf8;
        
        binmode STDOUT, ':encoding(UTF-8)';
        
        my $Moonshine = "Rươòu ðêì";
        my $enihsnooM = reverse $Moonshine;
        
        print "$Moonshine\n";
        print "$enihsnooM\n";
        

        …should produce different output than this…

        use utf8;
        
        binmode STDOUT, ':encoding(UTF-8)';
        
        my $Moonshine = "Rươòu ðêì";
        my $enihsnooM = join '', reverse $Moonshine =~ m/\X/g;
        
        print "$Moonshine\n";
        print "$enihsnooM\n";
        
          A reply falls below the community's threshold of quality. You may see it by logging in.
          A reply falls below the community's threshold of quality. You may see it by logging in.