pierswalter has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to print Unicode text in tabulated columns (assuming a fixed width font). To do so, I calculate the length of various strings using Perl's length() function.
This does not work for me for two reasons (the first one being irrelevant in my specific case, I just include it for completeness' sake for the archives):
1.
Some characters, when printed, occupy two character positions (i.e. they are twice as wide as other characters). An example is the Unicode character 'CJK UNIFIED IDEOGRAPH-341F' ("\x{341f}").
length("\x{341f}") returns 1, which is semantically correct.
So the fact that this would break my formatting is due to my simplifying assumption that each character can be printed in a single output character position. Since I'm currently not dealing with such characters, this is fine for me.
2.
For decomposed characters (e.g. the LATIN SMALL LETTER U, followed by the COMBINING DIAERESIS ("\x{0075}\x{0308}"), length() returns 2, not 1.
This is not what I expected, because these two Unicode characters are combined to form a single output character (ü).
On the other hand, there are two separate entities in the string, so I understand the logic of length() returning a length of 2.
Am I using the wrong approach? Is there a different function than length() that would return the number of output characters (i.e. 1 in the case of "\x{0075}\x{0308}")?
So far, the only solution I came up with is to convert each string into Unicode normalization form C before calculating its output length, but that seems more complicated than I feel this should be.
Here is an example that demonstrates the problem:
perl -e 'use strict; use warnings; use Unicode::Normalize; binmode(STD +OUT, ":utf8"); my $v1="\x{00fc}"; my $v2="\x{0075}\x{0308}"; my $v3=N +FC("\x{0075}\x{0308}"); print "$v1, length=" . length($v1) . "\n", "$ +v2, length=" . length($v2) . "\n", "$v3, length=" . length($v3) . "\n +"' ü, length=1 ü, length=2 ü, length=1
Thanks for your thoughts.