Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: The “real length" of UTF8 strings

by betterworld (Deacon)
on Sep 23, 2008 at 20:21 UTC ( #713298=note: print w/ replies, xml ) Need Help??


in reply to The “real length" of UTF8 strings

length returns the number of characters. To get the length in bytes, you have to convert the string into a given encoding:

my $s="\x{5fcd}\x{65e0}\x{53ef}\x{5fcd}"; use Encode; print length encode("utf8", $s), "\n"; # 12

Since Unicode strings are stored in utf8 internally, you can use a number of hacks to avoid the explicit re-encoding:

print do {use bytes; length($s)}, "\n"; # 12 (see perldoc -f length) # or utf8::encode($s); # resets the utf8 flag print length($s), "\n"; # 12


Comment on Re: The “real length" of UTF8 strings
Select or Download Code
Re^2: The “real length" of UTF8 strings
by moritz (Cardinal) on Sep 23, 2008 at 20:35 UTC
    In general the correlation between byte length in UTF-8 and visual character width is only a weak one.

    For example many european non-ASCII-characters are printed with visual width of only one character, but encoded as two bytes. The Euro sign is even encoded as three bytes, and still printed with a width of only one.

      You're right. I misunderstood the question and did not realize that we are looking for the "visual length".

      Well, I think it depends on the font then, doesn't it?

        Yes betterworld, I should have choosen a better title like "the visual length of UTF8 strings" instead of the "real length" which leads to confusion.

        Well, I think it depends on the font then, doesn't it?

        Maybe, but there are double-width characters that even fixed-width fonts display with the width of two normal characters (like the ones in the OP).

      You're right, the correlation between visual length and the actual number of characters is weak, and maybe only font dependent...

      But when I print these strings with both chinese and ASCII characters using the mysql command (SELECT * FROM...), it prints an array on stdout and is absolutely not confused with the visual and character lengths.

      That's why I think the solution must exist ^_^*

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://713298]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2014-12-19 06:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (71 votes), past polls