Re: Trying to determine the output length of a Unicode string

Is there a different function than length() that would return the number of output characters (i.e. 1 in the case of "\x{0075}\x{0308}")?

No, there isn't a built-in function. You must roll your own.

So far, the only solution I came up with is to convert each string into Unicode normalization form C before calculating its output length, but that seems more complicated than I feel this should be.

Normalizing to NFC isn't helpful in the general case. It doesn't ensure every character meaures one code point in length, so it can't be used generally to measure grapheme cluster length. Consider, for example, a lowercase M with both an umlaut and a cedilla…

#!perl

use strict;
use warnings;

use open qw( :encoding(UTF-8) :std );
use charnames qw( :full );
use Unicode::Normalize;

sub length_in_grapheme_clusters {
    my $length;
    $length++ while $_[0] =~ m/\X/g;
    return $length;
};

my $invented_character
    = "\N{LATIN SMALL LETTER M}"
    . "\N{COMBINING DIAERESIS}"
    . "\N{COMBINING CEDILLA}";

my $invented_character_NFC
    = NFC($invented_character);

my $length_of_invented_character_in_code_points
    = length $invented_character;

my $length_of_invented_character_NFC_in_code_points
    = length $invented_character_NFC;

my $length_of_invented_character_in_grapheme_clusters
    = length_in_grapheme_clusters($invented_character);

my $length_of_invented_character_NFC_in_grapheme_clusters
    = length_in_grapheme_clusters($invented_character_NFC);

print "$invented_character\n";
print "$length_of_invented_character_in_code_points\n";
print "$length_of_invented_character_NFC_in_code_points\n";
print "$length_of_invented_character_in_grapheme_clusters\n";
print "$length_of_invented_character_NFC_in_grapheme_clusters\n";

exit 0;
[download]

This prints…

m̧̈
3
3
1
1

Comment on Re: Trying to determine the output length of a Unicode string Download Code

Replies are listed 'Best First'.
Re^2: Trying to determine the output length of a Unicode string by Anonymous Monk on Sep 26, 2011 at 07:13 UTC
When I try this script I get m 3 3 1 1 How can I get this little special m̧̈?	[reply]
Re^3: Trying to determine the output length of a Unicode string by Anonymous Monk on Sep 26, 2011 at 07:35 UTC
You need a shell capable of displaying unicode (utf8)	[reply]
Re^4: Trying to determine the output length of a Unicode string by Anonymous Monk on Sep 26, 2011 at 08:12 UTC
I suppose my shell is capable of displaying unicode (utf8). When I run this `perl -CO -E 'say "\x{263a}"'` I get this output ☺	[reply] [d/l]
Re^5: Trying to determine the output length of a Unicode string by Jim (Curate) on Sep 26, 2011 at 20:07 UTC
Re^6: Trying to determine the output length of a Unicode string by Anonymous Monk on Sep 27, 2011 at 15:38 UTC
Re^3: Trying to determine the output length of a Unicode string by ikegami (Patriarch) on Sep 26, 2011 at 20:05 UTC
We need a terminal capable of handling combining marks. `$ perl -CS -MUnicode::Normalize -E'say NFC("\xE9")' é $ perl -CS -MUnicode::Normalize -E'say NFD("\xE9")' e` [download]	[reply] [d/l]
Re^2: Trying to determine the output length of a Unicode string by ikegami (Patriarch) on Sep 26, 2011 at 08:08 UTC
`sub length_in_grapheme_clusters { my $length; $length++ while $_[0] =~ m/\X/g; return $length; }` [download] As previously mentioned, this can be written as: `sub length_in_grapheme_clusters { my $length = () = $_[0] =~ /\X/g; return $length; }` [download] or `sub length_in_grapheme_clusters { return 0+( () = $_[0] =~ /\X/g ); }` [download] You must roll your own. As previously mentioned, he does not need to roll his own as there's already an existing solution. (It was also mentioned that `length_in_grapheme_clusters` is not sufficient.) Update: Fixed `length_in_grapheme_clusters` so it can be called in list context.	[reply] [d/l] [select]
Re^3: Trying to determine the output length of a Unicode string by Jim (Curate) on Sep 26, 2011 at 19:19 UTC
What "existing solution"? And why isn't `length_in_grapheme_clusters()` sufficient? Once again, it's obvious you're making some point, the thrust of which is undoubtedly that I'm wrong about something I wrote, but you're making it too laconically for me to get it. I have no idea what you're saying. By the way, the version of `length_in_grapheme_clusters()` I used in my Perl script is attributable to Tom Christiansen. I borrowed it from a PerlMonks post of his. To me, it's better because it makes the operation plainly clear. Your version is tricky and obfuscated, and seemingly weirdly dependent on context. To be honest, I don't understand how it works. To learn how it works, I'd have to read the Perl documentation.	[reply] [d/l] [select]
Re^4: Trying to determine the output length of a Unicode string by ikegami (Patriarch) on Sep 26, 2011 at 19:37 UTC
What "existing solution"? See Re: Trying to determine the output length of a Unicode string And why isn't length_in_grapheme_clusters() sufficient? See Re: Trying to determine the output length of a Unicode string I used in my Perl script is attributable to Tom Christiansen. Then you should find his comments about Text::Wrap as they are pertinent here. Maybe it was on the Perl5 Porters mailing list (which is archived). To be honest, I don't understand how it works Most people will say the same about Perl, `map`, etc, but that's a stupid reason not to use Perl, `map`, etc. Especially where performance matters, which is likely for this function. What I used: `()=` returns the length of the list returned by the expression that follows (when used in scalar context). How it works: List assignmemt in scalar context returns the number of elements to which the RHS evaluated. Your version is tricky and obfuscated It's actually very straightforward. There's nothing hidden, it uses well known idioms, and it require only the lowest mental load (only need to remember one value at a time). I'd have to read the Perl documentation. Really? I use list assignment in scalar context countless times a day. More often than the match operator, I dare say. Your implication that someone needs to read the docs for that, but not for `\X` and capture-less `m/.../g` is unconvincing.	[reply] [d/l] [select]
Re^5: Trying to determine the output length of a Unicode string by Jim (Curate) on Sep 26, 2011 at 20:18 UTC
Re^6: Trying to determine the output length of a Unicode string by ikegami (Patriarch) on Sep 26, 2011 at 20:29 UTC
Some notes below your chosen depth have not been shown here


Perl-Sensitive Sunglasses
	PerlMonks