http://www.perlmonks.org?node_id=927755

pierswalter has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to print Unicode text in tabulated columns (assuming a fixed width font). To do so, I calculate the length of various strings using Perl's length() function.

This does not work for me for two reasons (the first one being irrelevant in my specific case, I just include it for completeness' sake for the archives):

1.
Some characters, when printed, occupy two character positions (i.e. they are twice as wide as other characters). An example is the Unicode character 'CJK UNIFIED IDEOGRAPH-341F' ("\x{341f}").

length("\x{341f}") returns 1, which is semantically correct.
So the fact that this would break my formatting is due to my simplifying assumption that each character can be printed in a single output character position. Since I'm currently not dealing with such characters, this is fine for me.

2.
For decomposed characters (e.g. the LATIN SMALL LETTER U, followed by the COMBINING DIAERESIS ("\x{0075}\x{0308}"), length() returns 2, not 1.
This is not what I expected, because these two Unicode characters are combined to form a single output character (ü).
On the other hand, there are two separate entities in the string, so I understand the logic of length() returning a length of 2.

Am I using the wrong approach? Is there a different function than length() that would return the number of output characters (i.e. 1 in the case of "\x{0075}\x{0308}")?

So far, the only solution I came up with is to convert each string into Unicode normalization form C before calculating its output length, but that seems more complicated than I feel this should be.

Here is an example that demonstrates the problem:

perl -e 'use strict; use warnings; use Unicode::Normalize; binmode(STD +OUT, ":utf8"); my $v1="\x{00fc}"; my $v2="\x{0075}\x{0308}"; my $v3=N +FC("\x{0075}\x{0308}"); print "$v1, length=" . length($v1) . "\n", "$ +v2, length=" . length($v2) . "\n", "$v3, length=" . length($v3) . "\n +"' ü, length=1 ü, length=2 ü, length=1

Thanks for your thoughts.

Replies are listed 'Best First'.
Re: Trying to determine the output length of a Unicode string
by halley (Prior) on Sep 25, 2011 at 16:17 UTC

    I think it's kind of annoying that it took this long, but Perl 5.14 seems to be the answer here.

    From 'perldoc perlunicode':

    Starting in Perl 5.14, Perl-level operations work with characters rather than bytes within the scope of a use feature 'unicode_strings' (or equivalently use 5.012 or higher). (This is not true if bytes have been explicitly requested by use bytes, nor necessarily true for interactions with the platform's operating system.) For earlier Perls, and when unicode_strings is not in effect, Perl provides a fairly safe environment that can handle both types of semantics in programs. For operations where Perl can unambiguously decide that the input data are characters, Perl switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics.

    {Example cut, because perlmonks replaces japanese characters with entities.}

    --
    [ e d @ h a l l e y . c c ]

      unicode_strings only ensures that Perl uses character semantics instead of byte semantics for all string operations, which is helpful in the face of ambiguity. (See The "Unicode Bug" in perlunicode.) It doesn't alter the behavior of the length function, which measures the length of a Unicode string in code points, not in grapheme clusters (that is, in real characters).

      There's no built-in function in Perl to measure the length of a Unicode string in grapheme clusters rather than in code points.

      Read chromatic's article titled New Features of Perl 5.14: unicode_strings for a helpful overview of unicode_strings.

Re: Trying to determine the output length of a Unicode string
by ikegami (Patriarch) on Sep 26, 2011 at 00:04 UTC

    To get the "visual size" (my term, don't know if there's an official one) of a string, you need two pieces of information:

    • The number of graphemes.
    • The visual size of each of those graphemes.

    (And that's assuming your input has no control characters such as a newline.)

    The first is actually pretty easy:

    my @graphemes = $text =~ /\X/g; my $count = () = $text =~ /\X/g;

    NFC is definitely not the way to go as it doesn't work for every character-mark combination.

    The catch is knowing the width of characters. Some characters are zero-width, and others are double-width. For that, you really a need the help of a module. Unicode::GCString is such a module.

    my $size = Unicode::GCString->new($text)->columns();
Re: Trying to determine the output length of a Unicode string
by Jim (Curate) on Sep 26, 2011 at 05:15 UTC
    Is there a different function than length() that would return the number of output characters (i.e. 1 in the case of "\x{0075}\x{0308}")?

    No, there isn't a built-in function. You must roll your own.

    So far, the only solution I came up with is to convert each string into Unicode normalization form C before calculating its output length, but that seems more complicated than I feel this should be.

    Normalizing to NFC isn't helpful in the general case. It doesn't ensure every character meaures one code point in length, so it can't be used generally to measure grapheme cluster length. Consider, for example, a lowercase M with both an umlaut and a cedilla…

    #!perl use strict; use warnings; use open qw( :encoding(UTF-8) :std ); use charnames qw( :full ); use Unicode::Normalize; sub length_in_grapheme_clusters { my $length; $length++ while $_[0] =~ m/\X/g; return $length; }; my $invented_character = "\N{LATIN SMALL LETTER M}" . "\N{COMBINING DIAERESIS}" . "\N{COMBINING CEDILLA}"; my $invented_character_NFC = NFC($invented_character); my $length_of_invented_character_in_code_points = length $invented_character; my $length_of_invented_character_NFC_in_code_points = length $invented_character_NFC; my $length_of_invented_character_in_grapheme_clusters = length_in_grapheme_clusters($invented_character); my $length_of_invented_character_NFC_in_grapheme_clusters = length_in_grapheme_clusters($invented_character_NFC); print "$invented_character\n"; print "$length_of_invented_character_in_code_points\n"; print "$length_of_invented_character_NFC_in_code_points\n"; print "$length_of_invented_character_in_grapheme_clusters\n"; print "$length_of_invented_character_NFC_in_grapheme_clusters\n"; exit 0;

    This prints…

    m̧̈
    3
    3
    1
    1
    

      When I try this script I get

      m
      3
      3
      1
      1

      How can I get this little special m̧̈?

        You need a shell capable of displaying unicode (utf8)
        We need a terminal capable of handling combining marks.
        $ perl -CS -MUnicode::Normalize -E'say NFC("\xE9")' é $ perl -CS -MUnicode::Normalize -E'say NFD("\xE9")' e
      sub length_in_grapheme_clusters { my $length; $length++ while $_[0] =~ m/\X/g; return $length; }

      As previously mentioned, this can be written as:

      sub length_in_grapheme_clusters { my $length = () = $_[0] =~ /\X/g; return $length; }
      or
      sub length_in_grapheme_clusters { return 0+( () = $_[0] =~ /\X/g ); }

      You must roll your own.

      As previously mentioned, he does not need to roll his own as there's already an existing solution. (It was also mentioned that length_in_grapheme_clusters is not sufficient.)

      Update: Fixed length_in_grapheme_clusters so it can be called in list context.

        What "existing solution"? And why isn't length_in_grapheme_clusters() sufficient?

        Once again, it's obvious you're making some point, the thrust of which is undoubtedly that I'm wrong about something I wrote, but you're making it too laconically for me to get it. I have no idea what you're saying.

        By the way, the version of length_in_grapheme_clusters() I used in my Perl script is attributable to Tom Christiansen. I borrowed it from a PerlMonks post of his. To me, it's better because it makes the operation plainly clear. Your version is tricky and obfuscated, and seemingly weirdly dependent on context. To be honest, I don't understand how it works. To learn how it works, I'd have to read the Perl documentation.

Re: Trying to determine the output length of a Unicode string
by ikegami (Patriarch) on Sep 25, 2011 at 22:38 UTC