Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

function length() in UTF-8 context

by didess (Sexton)
on Nov 18, 2008 at 23:52 UTC ( #724430=perlquestion: print w/ replies, xml ) Need Help??
didess has asked for the wisdom of the Perl Monks concerning the following question:

Hi all !
I really need your help for a hint :
I need the length of strings which are UTF-8 coded.
I'm using Perl from Active state (5.8.8.822/Windows XP or 5.10.0.1004/Suse Linux 11.0) with the same "false" result.
(I think I'm wrong, but i can't find where).

for example:
length('a') returns 1 : That's right
length('à') returns 2 : That's wrong (I think it should return 1, because it's one CHARACTER long, although 2 BYTES because of UTF-8):

When looking at documentation (perlfunc / length) : I read that :
" Note the characters: if the EXPR is in Unicode, you will get the number of characters, not the number of bytes.
To get the length of the internal string in bytes, use bytes::length(EXPR), see the bytes manpage.
Note that the internal encoding is variable, and the number of bytes usually meaningless.
To get the number of bytes that the string would have when encoded as UTF-8, use length(Encoding::encode_utf8(EXPR))"

It seems to play the opposite role!
Any Idea ?
Thanks in advance
Didier

Comment on function length() in UTF-8 context
Re: function length() in UTF-8 context
by ikegami (Pope) on Nov 19, 2008 at 00:01 UTC

    Strings are treated as iso-latin-1 by default. Decode them if they're not.

    my $s = "\xC2\x85"; print(length($s), "\n"); # 2 iso-latin-1 chars utf8::decode($s); print(length($s), "\n"); # 1 character

    Way of decoding:

    • use open 'std', 'locale';
    • use open IO => 'encoding(UTF-8)';
    • open(my $fh, '<:encoding(UTF-8)', ...)
    • utf8::decode($s);
    • use Encode qw( decode ); $s = decode('UTF-8', $s);
      Thank you for this answer.
      I implemented it successfully to get the good length.
      I also used it to solve the same problem I got afterwards about usage of "uc()" function (decode($s), then uc($s), then encode($s)):
      so, it seems that we have to "come back" to iso8859-1 previously to use safely "string functions", and afterwards back-back to UTF8 ?.
      It doesn't sound like documentation says it, but it works fine
      Do you know whether it is "normal and stable" or "intermediate state" ?
      Or : should I just patch some of my scripts while waiting or implement something harder on my own ?
      Anyhow, Thanks a lot!
      Didier

        so, it seems that we have to "come back" to iso8859-1 previously to use safely "string functions"

        I don't know what you mean by this.

        You decode on input, and encode on output. Leave it decoded for the duration of your program. Unless you're working with binary file formats, you shouldn't have to call encode or decode. Just using an appropriate :encoding when opening files should take care of text files.

        It's tricky, isn't it.

        Two things make this tricky, first it can be difficult to see where bytes and characters are being encoded/decoded, and second Perl handles "old-fashioned" strings of bytes as well as "new-fangled" wide characters.

        NB: the following applies to Perl v5.8.8 or later. There is, apparently, an EBCDIC Perl, which I know nothing about.

        The actors in this drama are:

        1. Perl's string handling -- including encode/decode.

        2. the Perl IO Layers.

        3. the OS and display devices -- ie everything else !

        The following attempts to show, one step at a time, how the input, output and handling of wide character strings can be achieved, starting with byte strings and working up.

        I confess this turned out to be a lot longer than I had expected/intended. I hope somebody will find it useful.


        Starting at the top, Perl handles two forms of string. The first form is, fundamentally, an array of unsigned char (in C terms) -- byte form. Where you ask Perl to interpret these as characters, it will assume at least ASCII (eg uc($s)) -- possibly more if you use deeper magic. The second form is, in effect, an array of wide character ordinals in the range 0..(2^32)-1 (roughly speaking). When you ask Perl to interpret these as characters, it will assume Unicode. Attached to every string is a "mode-bit", telling Perl whether it contains bytes or wide characters.

        The following will illustrate some of what is going on as Perl, Perl IO and the OS conspire together:

          use strict ;
          use warnings ;
        
          my $UT8 = 0 ;   # Option to "use utf8" -- may be commented out, below
          my $ECO = 0 ;   # Option to set ":encoding(utf8)" on STDOUT
          my $ECI = 0 ;   # Option to set ":encoding(utf8)" on STDIN
          my $UPG = 0 ;   # Option to "utf8::upgrade($s)" in show()
          my $UC  = 0 ;   # Option to "uc($s)" in show()
        
          foreach (@ARGV) {
            if    ($_ eq 'eco') { $ECO = 1 ; }
            elsif ($_ eq 'eci') { $ECI = 1 ; }
            elsif ($_ eq 'upg') { $UPG = 1 ; }
            elsif ($_ eq 'uc')  { $UC  = 1 ; }
            else  { die "$_ not known" ;     } ;
          } ;
        
          #use utf8 ;  $UT8 = 1 ;
        
          if ($ECO) { binmode STDOUT, ":encoding(utf8)" ; } ;
          if ($ECI) { binmode STDIN , ":encoding(utf8)" ; } ;
        
          my $m = '' ;
          if ($UT8)  { $m .= " use utf8 ;" ;                } ;
          if ($ECO)  { $m .= " STDOUT :encoding(utf8) ;" ;  } ;
          if ($ECI)  { $m .= " STDIN :encoding(utf8) ;" ;   } ;
          if ($UPG)  { $m .= " utf8::upgrade(\$s) ;" ;      } ;
          if ($UC)   { $m .= " uc(\$s) ;" ;                 } ;
          if ($m)    { print "Options:$m\n" ;               } ;
        
          show(1, "Hello World") ;
          show(2, "Hello W\xF6rld") ;
          show(3, "Hello W\x{14D}rld") ;
          show(4, "Hello Wörld") ;        # ord('ö') is 0xF6
          show(5, "Hello Wōrld") ;        # ord('ō') is 0x14D
        
          my $n = 'a' ;
          while (my $s = <STDIN>) {
            chomp($s) ;
            show($n++, $s) ;
          } ;
        
          sub show {
            my ($n, $s) = @_ ;
            if ($UPG) { utf8::upgrade($s) ; } ;
            if ($UC)  { $s = uc($s) ;       } ;
            print " $n: '$s' is ", utf8::is_utf8($s) ? "'wide'" : "'byte'",
                    " len=", length($s), " \"", peek($s). "\"\n" ;
          } ;
        
          sub peek {      # Peek at the byte contents of the given string
            my ($s) = @_ ;
        
            use bytes ;   # Forces the unpack to show the byte contents of $s
        
            return join ('', map { ($_ >= 0x20) && ($_ < 0x7F) ? chr($_) : sprintf('\\x%02X', $_)
                                 } unpack('C*', $s)) ;
          } ;
        
        and the file read via STDIN is
          Hello World
          Hello Wörld
          Hello Wōrld
        


        With all the options off, on my machine the code above gave:

         1: 'Hello World' is 'byte' len=11 "Hello World"
         2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld"
        Wide character in print at x.pl line 47.
         3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         4: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
         5: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"
         a: 'Hello World' is 'byte' len=11 "Hello World"
         b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
         c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"
        
        YMMV. This shows a number of things:
        • that literal strings are byte strings, unless a character with ordinal > 0xFF forces otherwise.

          (There does not appear to be a way to declare that a given literal string should be held as a "wide" string.)

        • if we looked at the source with our favourite hex editor, we'd see that the characters 'ö' and 'ō' appear in the file as the UTF-8 sequences "\xC3\xB6" and "\xC5\x8D" respectively. Perl happily accepts those byte values, and the string's mode, length and contents reflect that.

        • the string that is "wide" is actually held with characters > 0x7F encoded as UTF-8. This is key: "wide" character strings are actually held in UTF-8 encoded form.

          What I'm calling the "wide" mode is known to Perl as "utf8". This is why.

        • the three lines (a to c) read from STDIN are all, by default also byte strings.

        • when printing these strings, Perl is simply sending the byte contents to the OS. If the output is sent to some device, and the device expects UTF-8, then we will see what we expect (otherwise, not). (In fact, the Perl IO layer is "downgrading" the wide string, but that's covered below.)

          I was lucky. String "2" contains a \xF6 byte, which is not valid UTF-8, so is rendered as a "splodge" (▒), on my machine.

        • you do not need to use utf8 to make the various utf8::xxxx() functions available. In fact, you must not use utf8 for that purpose -- because use utf8 means something else, see below.


        Before we start to worry about encoding and decoding and other magic, let's see how a character function works with what we have so far. Turning on the "uc($s)" option gives:

        Options: uc($s) ;
         1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
         2: 'HELLO W▒RLD' is 'byte' len=11 "HELLO W\xF6RLD"
        Wide character in print at x.pl line 47.
         3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         4: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD"
         5: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"
         a: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
         b: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD"
         c: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"
        
        showing that uc() is only interested in ASCII (on my machine, anyway) in the byte mode strings, but has done a wonderful Unicode job on the wide string.

        Now, how do we convert these strings from byte to wide mode, so that the contents will be treated as Unicode characters ? One way is to use utf8::upgrade($s), which gives:

        Options: utf8::upgrade($s) ; uc($s) ;
         1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
         2: 'HELLO W▒RLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
        Wide character in print at x.pl line 47.
         3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         4: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD"
         5: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"
         a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
         b: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD"
         c: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"
        
        which is really wierd and unpleasant, but it does illustrate one piece of magic, concerning characters "\x80".."\xFF". In a byte mode string these characters appear exactly so. In a wide mode string these characters appear in their UTF-8 encoding, "\xC2\x80".."\xC3\xBF". Both forms, however, map to the same range of character ordinals, \x80..\xFF. So, Perl maps between the character encodings. Hence:
        • utf8::upgrade($s) when given a byte string, takes all byte values 0x80..0xFF and replaces them by the equivalent two byte UTF-8 sequence, and then sets the string to be "wide". utf8::upgrade($s) does nothing when given a string which is already "wide".

          For string (2) this has translated "\xF6" to "\xC3\xB6" which uc($s) recognises, and upshifts to "\xC3\x96". This makes perfect sense if the string was in the Latin-1 character set -- so character 'ö' has been upshifted to 'Ö'. When printed it still shows as a "splodge", but see below.

          Strings (4) & (5) and lines (a) & (b) from STDIN actually contain UTF-8 sequences, but utf8::upgrade($s) doesn't know that, and it translates each byte to its equivalent UTF-8 sequence. Not quite what one had in mind ! It so happens that the result is either already uppercase, or has no uppercase.

        • print is still expecting to output bytes. When given a wide mode string, it is happy to take UTF-8 sequences and translate them back to single bytes, except where the UTF-8 sequence gives an ordinal > 0xFF.

          This is why string (2) still shows a "splodge". The IO Layers see "\xC3\x96" in a wide string, and translate that back down the single byte "\xD6", which isn't a valid UTF-8 sequence, so the device shows "splodge".

          With string (3), print sees "\xC5\x8C" which cannot be translated to a single byte, so we get a warning message and the bytes "\xC5\x8C" are output unchanged.

          With string (4), print sees "\xC3\x83\xC2\xB6" which translate back to "\xC3\xB6", which is what we started with ! Similarly string (5) and lines (b) & (c).

        The message is: as far as Perl is concerned byte string characters "\x80".."\xFF" are interchangeable with wide string characters with UTF-8 sequences "\xC2\x80".."\xC3\xBF". The utf8::upgrade() and utf8::downgrade() functions do this. It also happens when Perl implicitly forces a string to wide or to byte -- as we've seen print do.


        So, now we look at what use utf8 does. If we turn that on, and turn off the other options, the code gives:

        Options: use utf8 ;
         1: 'Hello World' is 'byte' len=11 "Hello World"
         2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld"
        Wide character in print at x.pl line 47.
         3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         4: 'Hello W▒rld' is 'wide' len=11 "Hello W\xC3\xB6rld"
        Wide character in print at x.pl line 47.
         5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         a: 'Hello World' is 'byte' len=11 "Hello World"
         b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
         c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"
        
        we can see an improvement.

        What use utf8 does is to tell Perl to expect the source to be in UTF-8 form, and in particular to interpret UTF-8 sequences in literal strings. As shown above, strings (4) and (5) are now wide mode.

        So far, so good. To sort out the printing we have to tell Perl to encode stuff as UTF-8, and we can do that on a per filehandle basis. Turning on the "STDOUT :encoding(utf8)" option, the code gives:

        Options: use utf8 ; STDOUT :encoding(utf8) ;
         1: 'Hello World' is 'byte' len=11 "Hello World"
         2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld"
         3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
         5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         a: 'Hello World' is 'byte' len=11 "Hello World"
         b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
         c: 'Hello W�rld' is 'byte' len=12 "Hello W\xC5\x8Drld"
        
        Nearly there, most characters are showing as desired -- no more "splodges" -- and if you examine the output you will see that we're getting UTF-8 everywhere. But the lines input from STDIN now look odd.

        Note especially string (2). This is in byte form. When printed with ':encoding(utf8)', byte strings are implicitly "upgraded" to UTF-8 -- remembering that this is implicitly treating the byte values as being in Latin-1 character set.

        The lines (b) & (c) are also in byte form, and those two are implicitly "upgraded" to UTF-8

        Turning on the "STDIN :encoding(utf8)" option, the code gives:

        Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ;
         1: 'Hello World' is 'byte' len=11 "Hello World"
         2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld"
         3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
         5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
         a: 'Hello World' is 'wide' len=11 "Hello World"
         b: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
         c: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
        
        At last ! If we tell Perl that literal strings contain UTF-8 (use utf8), that the input is UTF-8 encoded (:encoding(utf8)) and the output should also be UTF-8 encoded -- then, surprise (!), we appear to get what we want.


        Are we there yet ? Not quite. If we now try our uc($s) option, we get:

          
        Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; uc($s) ;
         1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
         2: 'HELLO WöRLD' is 'byte' len=11 "HELLO W\xF6RLD"
         3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
         5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
         b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
         c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
        
        ...everything is fine, except for the byte form string -- which came from the literal with the "\xF6" escape. We still need the "utf8::upgrade($s)" option, and so:
         
        Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; utf8::upgrade($s) ; uc($s) ;
         1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
         2: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
         3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
         5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
         a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
         b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
         c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
        
        and at last we've succeeded in:
        • getting UTF-8 encoded literal strings, and from a file.

        • outputing UTF-8 encoded strings.

        • getting a representative character function to work on all these different strings.

        Literal strings such as (2) above are a problem. If at some point they are not "upgraded", then they will not operate as intended. Some things (eg print) will implicitly "upgrade" byte strings. If a wide string and a byte string are processed together, the byte string will be implicitly upgraded. At other times a byte string will be processed as is. You may choose to always "upgrade" such strings as soon as they are assigned, or alway "upgrade" everything before running some wide character operation on it.

        The following may, or may not, appeal:

        sub qu ($) { utf8::upgrade(my $s = $_[0]) ; return $s ; } ; show(6, qu "Hello W\xF6rld") ;


        This area is complicated. Partly because wide character handling is inherently complicated, and generally unfamiliar. Also partly because Perl has to avoid breaking (too much) stuff which depends on the old, familiar byte string handling.

        In the above I have tried to show how the various parts hang together and which part does what. The conclusion is that if you ensure that all sources and sinks of strings are correctly set to expect UTF-8, then things are pretty straightforward. Along the way, however, I have tried to show why all those are necessary.

        For more on how to set filehandles to handle UTF-8 (and other) encodings, see open and binmode.

Re: function length() in UTF-8 context
by salva (Monsignor) on Nov 19, 2008 at 07:36 UTC
    To debug scripts that work with unicode data, Devel::Peek Dump function will let you see how perl represents strings internally.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://724430]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2014-07-10 02:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (198 votes), past polls