Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Trying to determine the output length of a Unicode string

by Jim (Curate)
on Sep 26, 2011 at 05:15 UTC ( #927793=note: print w/ replies, xml ) Need Help??


in reply to Trying to determine the output length of a Unicode string

Is there a different function than length() that would return the number of output characters (i.e. 1 in the case of "\x{0075}\x{0308}")?

No, there isn't a built-in function. You must roll your own.

So far, the only solution I came up with is to convert each string into Unicode normalization form C before calculating its output length, but that seems more complicated than I feel this should be.

Normalizing to NFC isn't helpful in the general case. It doesn't ensure every character meaures one code point in length, so it can't be used generally to measure grapheme cluster length. Consider, for example, a lowercase M with both an umlaut and a cedilla…

#!perl use strict; use warnings; use open qw( :encoding(UTF-8) :std ); use charnames qw( :full ); use Unicode::Normalize; sub length_in_grapheme_clusters { my $length; $length++ while $_[0] =~ m/\X/g; return $length; }; my $invented_character = "\N{LATIN SMALL LETTER M}" . "\N{COMBINING DIAERESIS}" . "\N{COMBINING CEDILLA}"; my $invented_character_NFC = NFC($invented_character); my $length_of_invented_character_in_code_points = length $invented_character; my $length_of_invented_character_NFC_in_code_points = length $invented_character_NFC; my $length_of_invented_character_in_grapheme_clusters = length_in_grapheme_clusters($invented_character); my $length_of_invented_character_NFC_in_grapheme_clusters = length_in_grapheme_clusters($invented_character_NFC); print "$invented_character\n"; print "$length_of_invented_character_in_code_points\n"; print "$length_of_invented_character_NFC_in_code_points\n"; print "$length_of_invented_character_in_grapheme_clusters\n"; print "$length_of_invented_character_NFC_in_grapheme_clusters\n"; exit 0;

This prints…

m̧̈
3
3
1
1


Comment on Re: Trying to determine the output length of a Unicode string
Download Code
Re^2: Trying to determine the output length of a Unicode string
by Anonymous Monk on Sep 26, 2011 at 07:13 UTC

    When I try this script I get

    m
    3
    3
    1
    1

    How can I get this little special m̧̈?

      You need a shell capable of displaying unicode (utf8)

        I suppose my shell is capable of displaying unicode (utf8).

        When I run this

        perl -CO -E 'say "\x{263a}"'

        I get this output

      We need a terminal capable of handling combining marks.
      $ perl -CS -MUnicode::Normalize -E'say NFC("\xE9")' é $ perl -CS -MUnicode::Normalize -E'say NFD("\xE9")' e
Reaped: Re^2: Trying to determine the output length of a Unicode string
by NodeReaper (Curate) on Sep 26, 2011 at 08:06 UTC
Re^2: Trying to determine the output length of a Unicode string
by ikegami (Pope) on Sep 26, 2011 at 08:08 UTC
    sub length_in_grapheme_clusters { my $length; $length++ while $_[0] =~ m/\X/g; return $length; }

    As previously mentioned, this can be written as:

    sub length_in_grapheme_clusters { my $length = () = $_[0] =~ /\X/g; return $length; }
    or
    sub length_in_grapheme_clusters { return 0+( () = $_[0] =~ /\X/g ); }

    You must roll your own.

    As previously mentioned, he does not need to roll his own as there's already an existing solution. (It was also mentioned that length_in_grapheme_clusters is not sufficient.)

    Update: Fixed length_in_grapheme_clusters so it can be called in list context.

      What "existing solution"? And why isn't length_in_grapheme_clusters() sufficient?

      Once again, it's obvious you're making some point, the thrust of which is undoubtedly that I'm wrong about something I wrote, but you're making it too laconically for me to get it. I have no idea what you're saying.

      By the way, the version of length_in_grapheme_clusters() I used in my Perl script is attributable to Tom Christiansen. I borrowed it from a PerlMonks post of his. To me, it's better because it makes the operation plainly clear. Your version is tricky and obfuscated, and seemingly weirdly dependent on context. To be honest, I don't understand how it works. To learn how it works, I'd have to read the Perl documentation.

        What "existing solution"?

        See Re: Trying to determine the output length of a Unicode string

        And why isn't length_in_grapheme_clusters() sufficient?

        See Re: Trying to determine the output length of a Unicode string

        I used in my Perl script is attributable to Tom Christiansen.

        Then you should find his comments about Text::Wrap as they are pertinent here. Maybe it was on the Perl5 Porters mailing list (which is archived).

        To be honest, I don't understand how it works

        Most people will say the same about Perl, map, etc, but that's a stupid reason not to use Perl, map, etc. Especially where performance matters, which is likely for this function.

        What I used: ()= returns the length of the list returned by the expression that follows (when used in scalar context).

        How it works: List assignmemt in scalar context returns the number of elements to which the RHS evaluated.

        Your version is tricky and obfuscated

        It's actually very straightforward. There's nothing hidden, it uses well known idioms, and it require only the lowest mental load (only need to remember one value at a time).

        I'd have to read the Perl documentation.

        Really? I use list assignment in scalar context countless times a day. More often than the match operator, I dare say.

        Your implication that someone needs to read the docs for that, but not for \X and capture-less m/.../g is unconvincing.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://927793]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2014-07-12 00:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (236 votes), past polls