Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

The “real length" of UTF8 strings

by Anonymous Monk
on Sep 23, 2008 at 20:04 UTC ( [id://713297]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

When printing an UTF8 string with printf("%s"), it can actually be wider than expected due to, by example, the chinese characters that are "2 printed chars" wide on my terminal. My problem occurs when mixing letters with chinese character in a string: I'm unable to guess the actual length of the string to be printed.

The following code gives us informations about how the string is encoded inside Perl:

$ perl -MDevel::Peek -e 'use utf8; my $s="\x{5fcd}\x{65e0}\x{53ef}\x{5 +fcd}"; Dump($s); print length($s), "\n";' SV = PV(0x8154b00) at 0x8154720 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x816ff80 "\345\277\215\346\227\240\345\217\257\345\277\215"\0 +[UTF8 "\x{5fcd}\x{65e0}\x{53ef}\x{5fcd}"] CUR = 12 LEN = 16 4

Since I know my characters are 2 characters wide, I can try to guess the "real width" is 8 using the length() function (4 * 2 = 8).

But it doesn't work anymore when I enclose my string in brackets:

$ perl -MDevel::Peek -e 'use utf8; my $s="(\x{5fcd}\x{65e0}\x{53ef}\x{ +5fcd})"; Dump($s); print length($s), "\n";' SV = PV(0x8154b00) at 0x8154720 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x816ff80 "(\345\277\215\346\227\240\345\217\257\345\277\215)"\ +0 [UTF8 "(\x{5fcd}\x{65e0}\x{53ef}\x{5fcd})"] CUR = 14 LEN = 16 6
Now I have 6 characters, but I can't guess the "real length" is 10 (6 * 2 != 10), and the bytes length won't help...

Does anyone have an idea to measure these strings?

Replies are listed 'Best First'.
Re: The “real length" of UTF8 strings
by massa (Hermit) on Sep 24, 2008 at 00:16 UTC
    For any UTF8 string, we have four "lengths":
    1. the length in codepoints:
      perl -C63 -MDevel::Peek -Mutf8 -le '$_="(\x{5fcd} Guimarăes)"; Dump($_ +); print length($_); print' SV = PV(0x8154b00) at 0x8153bd4 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x8170460 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} +Guimar\x{e3}es)"] CUR = 16 LEN = 20 13
      (忍 Guimarăes)
    2. the length in graphemes (the "a" is one, the composing "~" is another):
      perl -C63 -MDevel::Peek -Mutf8 -MUnicode::Normalize -le '$_="(\x{5fcd} + Guimarăes)"; $_ = NFD $_; Dump($_); print length($_); print' SV = PV(0x8154b00) at 0x8153bd4 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x816feb8 "(\345\277\215 Guimara\314\203es)"\0 [UTF8 "(\x{5fcd} + Guimara\x{303}es)"] CUR = 17 LEN = 20 14
      (忍 Guimarães)
    3. the length in columns of text used (the string has one wide character):
      perl -C63 -MDevel::Peek -Mutf8 -mText::CharWidth=mbswidth -le '$_="(\x +{5fcd} Guimarăes)"; Dump($_); print mbswidth($_); print' SV = PV(0x8154b00) at 0x8153bd4 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x8170460 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} +Guimar\x{e3}es)"] CUR = 16 LEN = 20 14
      (忍 Guimarăes)
    4. the length in bytes of the string (notice I didn't print the string after the encode):
      perl -C63 -MDevel::Peek -Mutf8 -mEncode=encode_utf8 -le '$_="(\x{5fcd} + Guimarăes)"; Dump($_); print length(encode_utf8 $_)' SV = PV(0x8154b00) at 0x8153bd4 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x8170460 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} +Guimar\x{e3}es)"] CUR = 16 LEN = 20 16
    []s, HTH, Massa (κς,πμ,πλ)

      Thank you all for your help solving this problem. The Text::CharWidth module is definitely what I need and probably the quickest solution to implement.

      Thx ^_^*
Re: The “real length" of UTF8 strings
by moritz (Cardinal) on Sep 23, 2008 at 20:30 UTC
    As a first approximation you can loop over the characters, and add 0 for combining characters, 1 for "normal" one and 2 for characters in the Han script block.
    sub visual_length { my $s = shift; my $l = 0; while ($s =~ m/(.)/g){ my $c = $1; if ($c =~ m/\p{M}){ # do nothing } elsif ($c =~ m/\p{Han}) { $l += 2; } else { $l++; } } return $l; }

    That could use much more tweaking, but maybe it's a start for you.

      Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. I'll try to get more info about its UTF8 code range and if "one char visual length" and "two chars visual length" are not mixed together, that should be good :)

        Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand.

        That's why my example queries each character for the Unicode property \p{Han}, ie if the character is in that script block.

        For a better description of Unicode properties and script blocks in Regexes I recommend "Mastering Regular Expressions" by Jeffrey Friedl, pages 121pp.

Re: The “real length" of UTF8 strings
by gone2015 (Deacon) on Sep 23, 2008 at 20:30 UTC

    Can you used a regex to identify the the characters which are double length ? Something like:

    print xlen("(\x{5fcd}\x{65e0}\x{53ef}\x{5fcd})"), "\n" ; ; sub xlen { my ($s) = @_ ; my $l = length($s) ; while ($s =~ m/[\x{5000}-\x{6FFF}]/g) { $l++ ; } ; return $l ; } ;
    perhaps ?

    Or:

    print ylen("(\x{5fcd}\x{65e0}\x{53ef}\x{5fcd})"), "\n" ; ; sub ylen { my ($s) = @_ ; return length($s) + ($s =~ tr/[\x{5000}-\x{6FFF}]//) ; } ;
    which avoids running a while loop and may or may not be faster.

Re: The “real length" of UTF8 strings
by JavaFan (Canon) on Sep 23, 2008 at 23:21 UTC
    You might want to look at your OS C library, and see what it provides - it may have a function you can call with XS or Inline::C.

    The following code ought to work on my (Linux) system, except that it's thinking the characters given in the example aren't printable - and are hence given a length of -1. My manual says "The behaviour of wcwidth depends on the LC_CTYPE category of the current locale", but gives no hint on what to set it to.

    $ cat ./uu #!/usr/bin/perl use 5.010; use strict; use warnings; use Inline 'C'; my $s0 = "Hello, world"; my $s1 = "\x{5fcd}\x{65e0}\x{53ef}\x{5fcd}"; my $s2 = "($s1)"; my $l0 = w_length ($s0); my $l1 = w_length ($s1); my $l2 = w_length ($s2); say "$l0: $s0"; say "$l1: $s1"; say "$l2: $s2"; __END__ __C__ #include <wchar.h> int w_length(char* str) { int i; int length; char c; i = 0; length = 0; while(c = str[i++]) { int l; l = wcwidth(c); length += l > 0 ? l : 0; } return length; } $ LC_CTYPE=en_US.UTF-8 perl -CO ./uu 12: Hello, world
    0: 忍无可忍
    2: (忍无可忍)

    So, there's something missing in my solution, but I'm far from a Unicode expert, let alone the provided library on my system, but it maybe something you can use as a start.

Re: The “real length" of UTF8 strings
by betterworld (Curate) on Sep 23, 2008 at 20:21 UTC

    length returns the number of characters. To get the length in bytes, you have to convert the string into a given encoding:

    my $s="\x{5fcd}\x{65e0}\x{53ef}\x{5fcd}"; use Encode; print length encode("utf8", $s), "\n"; # 12

    Since Unicode strings are stored in utf8 internally, you can use a number of hacks to avoid the explicit re-encoding:

    print do {use bytes; length($s)}, "\n"; # 12 (see perldoc -f length) # or utf8::encode($s); # resets the utf8 flag print length($s), "\n"; # 12
      In general the correlation between byte length in UTF-8 and visual character width is only a weak one.

      For example many european non-ASCII-characters are printed with visual width of only one character, but encoded as two bytes. The Euro sign is even encoded as three bytes, and still printed with a width of only one.

        You're right. I misunderstood the question and did not realize that we are looking for the "visual length".

        Well, I think it depends on the font then, doesn't it?

        You're right, the correlation between visual length and the actual number of characters is weak, and maybe only font dependent...

        But when I print these strings with both chinese and ASCII characters using the mysql command (SELECT * FROM...), it prints an array on stdout and is absolutely not confused with the visual and character lengths.

        That's why I think the solution must exist ^_^*

Re: The “real length" of UTF8 strings
by ikegami (Patriarch) on Sep 24, 2008 at 08:17 UTC

    There's still confusion as to what you mean by visual length. Are you talking about the width of the characters in pixels? That will vary by font and device.

    Update: Oops, you seem to have found what you wanted in a post I somehow missed.

Re: The “real length" of UTF8 strings
by redgreen (Priest) on Sep 23, 2008 at 21:02 UTC
    use bytes; (but it was already mentioned)...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://713297]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found