Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

[SOLVED] Unicode strings internals

by vsespb (Chaplain)
on May 10, 2013 at 13:16 UTC ( #1032959=perlquestion: print w/replies, xml ) Need Help??

vsespb has asked for the wisdom of the Perl Monks concerning the following question:

I have two example files

poc1.pl:

use strict; use warnings; use Devel::Peek; use Encode; use utf8; my $string = "123\x{444}\x{444}\x{444}\x{444}"; binmode STDOUT, ":utf8"; Dump $string; print "UTF IS ON\n" if utf8::is_utf8($string); print "LENGTH DIFFERS\n" if length($string) != bytes::length($string); open my $f, ">", "test1"; binmode $f; syswrite $f, $string or die; print "ALL OK\n"; __END__ SV = PV(0x258cb78) at 0x25b7bb0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x25aad60 "123\321\204\321\204\321\204\321\204"\0 [UTF8 "123\x{ +444}\x{444}\x{444}\x{444}"] CUR = 11 LEN = 16 UTF IS ON LENGTH DIFFERS Wide character in syswrite at poc1.pl line 16.

poc2.pl:

use strict; use warnings; use Devel::Peek; use Encode; use utf8; my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}"; my ($ascii_but_utf, undef) = split ' ', $utfstring; my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}"); my $mixedstring = "$ascii_but_utf$bytestring"; # simulate The Unicode +Bug here binmode STDOUT, ":utf8"; Dump $mixedstring; print "UTF IS ON\n" if utf8::is_utf8($mixedstring); print "LENGTH DIFFERS\n" if length($mixedstring) != bytes::length($mix +edstring); open my $f, ">", "test2"; binmode $f; syswrite $f, $mixedstring or die; print "ALL OK\n"; __END__ SV = PV(0x1d6eb48) at 0x1c7fab8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1d79820 "123\303\221\302\204\303\221\302\204\303\221\302\204\ +303\221\302\204"\0 [UTF8 "123\x{d1}\x{84}\x{d1}\x{84}\x{d1}\x{84}\x{d +1}\x{84}"] CUR = 19 LEN = 24 UTF IS ON LENGTH DIFFERS ALL OK

After __END__ of each file I appended program output.

Let's ignore for the moment the fact that those strings completely different and contains different characters and the fact that at some point of time one of the strings was interpreted as latin-1 etc

So, in both cases strings have UTF-8 bit set. They both have non ASCII-7bit octets. Their length() and bytes::length differs. And I expect those strings should behave same way

Question is why in one case string was treated as 'wide character string' and syswrite terminated the program. In other case all was working fine

p.s reproduced on perl 5.10 and perl 5.14 (linux)

UPD: escaped utf chars in sourcecode, as perlmonks eats it

UPD: SOLVED: http://www.perlmonks.org/?node_id=1032996 http://www.perlmonks.org/?node_id=1033006

Replies are listed 'Best First'.
Re: Unicode strings internals
by kennethk (Abbot) on May 10, 2013 at 15:49 UTC
    If I'm reading your code correctly, the issue is that in your first case you have a properly formatted Perl string that contains UTF characters, but in the second you have a UTF-8 byte string, not a character string. The difference is discussed a bit in perluniintro and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    By explicitly invoking encode ("UTF-8", ..., the mixed string contains bytes with the high bit set, but not UTF-specific characters. Outputting a byte string as binary is natural, but outputting a Perl string that contains wide characters does not map without specifying an encoding.

    Does this clarify? If you describe the task you are trying to accomplish, we can probably help with the appropriate set of I/O specifications.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      in your first case you have a properly formatted Perl string that contains UTF characters
      Yes
      but in the second you have a UTF-8 byte string, not a character string.
      No. Second case does look like UTF-8 character string, because it prints "UTF IS ON" and "LENGTH DIFFERS"
        Note that if you modify line 8 to
        my $ascii_but_utf = '123';
        the output changes to
        SV = PV(0x22ae1d0) at 0x2300b20 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x22d21f0 "123\321\204\321\204\321\204\321\204"\0 CUR = 11 LEN = 16 ALL OK
        This is because that UTF is on is just a historical artifact of your initialization.

        If we take a look at the two output files generated by these two cases, you'll note that both contain 11 bytes, despite the fact that the byte dump of the UTF-upgraded case should have output 19 bytes. This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output. You wouldn't expect these 1-byte characters to output a wide-character warning any more that you'd expect an ASCII character to.

        Second case does look like UTF-8 character string,

        You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string. When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Unicode strings internals
by Krambambuli (Curate) on May 10, 2013 at 16:49 UTC
    Have a look into the results of
    #!/usr/bin/perl use strict; use warnings; use Devel::Peek; use Encode; #use utf8; binmode STDOUT, ":utf8"; my $string1 = "123\x{444}\x{444}\x{444}\x{444}"; _display ($string1, 'STRING1' ); my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}"; _display ($utfstring, 'UTF_STRING' ); my ($ascii_but_utf, undef) = split ' ', $utfstring; _display ($ascii_but_utf, 'ASCII_BUT_UTF' ); #my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}"); my $bytestring = "\x{444}\x{444}\x{444}\x{444}"; _display ($bytestring, 'BYTESTRING' ); my $mixedstring = "$ascii_but_utf$bytestring"; # simulate The Unicode +Bug here _display ($mixedstring, 'MIXEDSTRING' ); print "MIXEDSTRING and STRING1 are supposed to be identical...\n"; exit; ############### sub _display { my ($string, $name) = @_; print "$name:\n"; Dump $string; my $l1 = length($string); my $l2 = bytes::length($string); if ($l1 != $l2) { print "LENGTHs DIFFERS: length: $l1, bytes: $l2\n" } print "UTF IS ON\n" if utf8::is_utf8($string); print "\n"; }
    and then check the difference you see for BYTESTRING when running

    my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}");

    versus

    my $bytestring = "\x{444}\x{444}\x{444}\x{444}";

    The Encode documentation has an Caveat about it:

    CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets might not be equal to $string. Though
    both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, the UTF8 flag
    on the result is always off, even when it contains a completely valid utf8 string. See "The UTF8 flag" below.

      Yes, I understand that result of encode("utf8", ... ) is a byte string with UTF-8 flag off. But that does not answer the question in my post. In my example, both poc1.pl and poc2.pl print strings with UTF-8 on, with length <> bytes::length, but those strings behave differently. Why?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1032959]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2023-02-02 01:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (15 votes). Check out past polls.

    Notices?