didess has asked for the wisdom of the Perl Monks concerning the following question:
I really need your help for a hint :
I need the length of strings which are UTF-8 coded.
I'm using Perl from Active state (5.8.8.822/Windows XP or 5.10.0.1004/Suse Linux 11.0) with the same "false" result.
(I think I'm wrong, but i can't find where).
for example:
length('a') returns 1 : That's right
length('à') returns 2 : That's wrong (I think it should return 1, because it's one CHARACTER long, although 2 BYTES because of UTF-8):
When looking at documentation (perlfunc / length) : I read that :
" Note the characters: if the EXPR is in Unicode, you will get the number of characters, not the number of bytes.
To get the length of the internal string in bytes, use bytes::length(EXPR), see the bytes manpage.
Note that the internal encoding is variable, and the number of bytes usually meaningless.
To get the number of bytes that the string would have when encoded as UTF-8, use length(Encoding::encode_utf8(EXPR))"
It seems to play the opposite role!
Any Idea ?
Thanks in advance
Didier
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: function length() in UTF-8 context
by ikegami (Patriarch) on Nov 19, 2008 at 00:01 UTC | |
Strings are treated as iso-latin-1 by default. Decode them if they're not.
Way of decoding:
| [reply] [Watch: Dir/Any] [d/l] [select] |
by didess (Sexton) on Nov 19, 2008 at 07:59 UTC | |
I implemented it successfully to get the good length. I also used it to solve the same problem I got afterwards about usage of "uc()" function (decode($s), then uc($s), then encode($s)): so, it seems that we have to "come back" to iso8859-1 previously to use safely "string functions", and afterwards back-back to UTF8 ?. It doesn't sound like documentation says it, but it works fine Do you know whether it is "normal and stable" or "intermediate state" ? Or : should I just patch some of my scripts while waiting or implement something harder on my own ? Anyhow, Thanks a lot! Didier | [reply] [Watch: Dir/Any] |
by ikegami (Patriarch) on Nov 19, 2008 at 08:11 UTC | |
I don't know what you mean by this. You decode on input, and encode on output. Leave it decoded for the duration of your program. Unless you're working with binary file formats, you shouldn't have to call encode or decode. Just using an appropriate :encoding when opening files should take care of text files. | [reply] [Watch: Dir/Any] [d/l] [select] |
by didess (Sexton) on Nov 19, 2008 at 10:47 UTC | |
by ikegami (Patriarch) on Nov 19, 2008 at 10:57 UTC | |
by ikegami (Patriarch) on Nov 19, 2008 at 11:10 UTC | |
by gone2015 (Deacon) on Nov 20, 2008 at 01:10 UTC | |
It's tricky, isn't it. Two things make this tricky, first it can be difficult to see where bytes and characters are being encoded/decoded, and second Perl handles "old-fashioned" strings of bytes as well as "new-fangled" wide characters. NB: the following applies to Perl v5.8.8 or later. There is, apparently, an EBCDIC Perl, which I know nothing about. The actors in this drama are: The following attempts to show, one step at a time, how the input, output and handling of wide character strings can be achieved, starting with byte strings and working up. I confess this turned out to be a lot longer than I had expected/intended. I hope somebody will find it useful. Starting at the top, Perl handles two forms of string. The first form is, fundamentally, an array of unsigned char (in C terms) -- byte form. Where you ask Perl to interpret these as characters, it will assume at least ASCII (eg uc($s)) -- possibly more if you use deeper magic. The second form is, in effect, an array of wide character ordinals in the range 0..(2^32)-1 (roughly speaking). When you ask Perl to interpret these as characters, it will assume Unicode. Attached to every string is a "mode-bit", telling Perl whether it contains bytes or wide characters. The following will illustrate some of what is going on as Perl, Perl IO and the OS conspire together: use strict ; use warnings ; my $UT8 = 0 ; # Option to "use utf8" -- may be commented out, below my $ECO = 0 ; # Option to set ":encoding(utf8)" on STDOUT my $ECI = 0 ; # Option to set ":encoding(utf8)" on STDIN my $UPG = 0 ; # Option to "utf8::upgrade($s)" in show() my $UC = 0 ; # Option to "uc($s)" in show() foreach (@ARGV) { if ($_ eq 'eco') { $ECO = 1 ; } elsif ($_ eq 'eci') { $ECI = 1 ; } elsif ($_ eq 'upg') { $UPG = 1 ; } elsif ($_ eq 'uc') { $UC = 1 ; } else { die "$_ not known" ; } ; } ; #use utf8 ; $UT8 = 1 ; if ($ECO) { binmode STDOUT, ":encoding(utf8)" ; } ; if ($ECI) { binmode STDIN , ":encoding(utf8)" ; } ; my $m = '' ; if ($UT8) { $m .= " use utf8 ;" ; } ; if ($ECO) { $m .= " STDOUT :encoding(utf8) ;" ; } ; if ($ECI) { $m .= " STDIN :encoding(utf8) ;" ; } ; if ($UPG) { $m .= " utf8::upgrade(\$s) ;" ; } ; if ($UC) { $m .= " uc(\$s) ;" ; } ; if ($m) { print "Options:$m\n" ; } ; show(1, "Hello World") ; show(2, "Hello W\xF6rld") ; show(3, "Hello W\x{14D}rld") ; show(4, "Hello Wörld") ; # ord('ö') is 0xF6 show(5, "Hello Wōrld") ; # ord('ō') is 0x14D my $n = 'a' ; while (my $s = <STDIN>) { chomp($s) ; show($n++, $s) ; } ; sub show { my ($n, $s) = @_ ; if ($UPG) { utf8::upgrade($s) ; } ; if ($UC) { $s = uc($s) ; } ; print " $n: '$s' is ", utf8::is_utf8($s) ? "'wide'" : "'byte'", " len=", length($s), " \"", peek($s). "\"\n" ; } ; sub peek { # Peek at the byte contents of the given string my ($s) = @_ ; use bytes ; # Forces the unpack to show the byte contents of $s return join ('', map { ($_ >= 0x20) && ($_ < 0x7F) ? chr($_) : sprintf('\\x%02X', $_) } unpack('C*', $s)) ; } ;and the file read via STDIN is Hello World Hello Wörld Hello Wōrld With all the options off, on my machine the code above gave: 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld" Wide character in print at x.pl line 47. 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"YMMV. This shows a number of things:
Before we start to worry about encoding and decoding and other magic, let's see how a character function works with what we have so far. Turning on the "uc($s)" option gives: Options: uc($s) ; 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" 2: 'HELLO W▒RLD' is 'byte' len=11 "HELLO W\xF6RLD" Wide character in print at x.pl line 47. 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD" 5: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD" a: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" b: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD" c: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"showing that uc() is only interested in ASCII (on my machine, anyway) in the byte mode strings, but has done a wonderful Unicode job on the wide string. Now, how do we convert these strings from byte to wide mode, so that the contents will be treated as Unicode characters ? One way is to use utf8::upgrade($s), which gives: Options: utf8::upgrade($s) ; uc($s) ; 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" 2: 'HELLO W▒RLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" Wide character in print at x.pl line 47. 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD" 5: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD" c: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"which is really wierd and unpleasant, but it does illustrate one piece of magic, concerning characters "\x80".."\xFF". In a byte mode string these characters appear exactly so. In a wide mode string these characters appear in their UTF-8 encoding, "\xC2\x80".."\xC3\xBF". Both forms, however, map to the same range of character ordinals, \x80..\xFF. So, Perl maps between the character encodings. Hence:
So, now we look at what use utf8 does. If we turn that on, and turn off the other options, the code gives: Options: use utf8 ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld" Wide character in print at x.pl line 47. 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello W▒rld' is 'wide' len=11 "Hello W\xC3\xB6rld" Wide character in print at x.pl line 47. 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"we can see an improvement. What use utf8 does is to tell Perl to expect the source to be in UTF-8 form, and in particular to interpret UTF-8 sequences in literal strings. As shown above, strings (4) and (5) are now wide mode. So far, so good. To sort out the printing we have to tell Perl to encode stuff as UTF-8, and we can do that on a per filehandle basis. Turning on the "STDOUT :encoding(utf8)" option, the code gives: Options: use utf8 ; STDOUT :encoding(utf8) ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld" 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello WÅ�rld' is 'byte' len=12 "Hello W\xC5\x8Drld"Nearly there, most characters are showing as desired -- no more "splodges" -- and if you examine the output you will see that we're getting UTF-8 everywhere. But the lines input from STDIN now look odd. Note especially string (2). This is in byte form. When printed with ':encoding(utf8)', byte strings are implicitly "upgraded" to UTF-8 -- remembering that this is implicitly treating the byte values as being in Latin-1 character set. The lines (b) & (c) are also in byte form, and those two are implicitly "upgraded" to UTF-8 Turning on the "STDIN :encoding(utf8)" option, the code gives: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld" 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'wide' len=11 "Hello World" b: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"At last ! If we tell Perl that literal strings contain UTF-8 (use utf8), that the input is UTF-8 encoded (:encoding(utf8)) and the output should also be UTF-8 encoded -- then, surprise (!), we appear to get what we want. Are we there yet ? Not quite. If we now try our uc($s) option, we get: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; uc($s) ; 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" 2: 'HELLO WöRLD' is 'byte' len=11 "HELLO W\xF6RLD" 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"...everything is fine, except for the byte form string -- which came from the literal with the "\xF6" escape. We still need the "utf8::upgrade($s)" option, and so: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; utf8::upgrade($s) ; uc($s) ; 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" 2: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"and at last we've succeeded in:
Literal strings such as (2) above are a problem. If at some point they are not "upgraded", then they will not operate as intended. Some things (eg print) will implicitly "upgrade" byte strings. If a wide string and a byte string are processed together, the byte string will be implicitly upgraded. At other times a byte string will be processed as is. You may choose to always "upgrade" such strings as soon as they are assigned, or alway "upgrade" everything before running some wide character operation on it. The following may, or may not, appeal:
This area is complicated. Partly because wide character handling is inherently complicated, and generally unfamiliar. Also partly because Perl has to avoid breaking (too much) stuff which depends on the old, familiar byte string handling. In the above I have tried to show how the various parts hang together and which part does what. The conclusion is that if you ensure that all sources and sinks of strings are correctly set to expect UTF-8, then things are pretty straightforward. Along the way, however, I have tried to show why all those are necessary. For more on how to set filehandles to handle UTF-8 (and other) encodings, see open and binmode. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: function length() in UTF-8 context
by salva (Canon) on Nov 19, 2008 at 07:36 UTC | |
| [reply] [Watch: Dir/Any] |