http://www.perlmonks.org?node_id=742047


in reply to Re^2: Character encoding of microns
in thread Character encoding of microns

As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see...

What you are seeing when you print to STDOUT takes a little explaining...

By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1).

When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character.

When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge.

You can tell STDOUT that it's a UTF-8 file-handle using binmode, so:

use strict ; use warnings ; use PerlIO ; use Encode; my $clob = "this is string with \x{C2}\x{B5} in it"; my $convertedstr = decode("utf8",$clob); print "clob: " ; bytes($clob) ; print "conv: " ; bytes($convertedstr) ; my @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; binmode(STDOUT, ":encoding(UTF-8)") ; @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces:
clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte
conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8
unix perlio
clob: 'this is string with µ in it'
conv: 'this is string with ▒ in it'
unix perlio encoding(utf-8-strict) utf8
clob: 'this is string with µ in it'
conv: 'this is string with µ in it'
So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is 'Â'.

The message is that you have to be consistent:

But if you try mixing the two, confusion will reign.

See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.

Replies are listed 'Best First'.
Re^4: Character encoding of microns
by joec_ (Scribe) on Feb 10, 2009 at 11:16 UTC
    Hi, Im grateful for your detailed explanation. But, i am still having problems.

    If i run your code, with the micron encoded as \x{C2}\x{B5} then just using decode('utf8',$clob) seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff.

    clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with µ in it' unix perlio encoding(utf8) utf8 clob: 'this is string with õ in it' conv: 'this is string with µ in it'
    However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned use diagnostics on.
    clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:B5: +20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:EF:B +F:BD:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' Wide character in print at 742047.pl line 19 (#1) (W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. conv: 'this is string with � in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with � in it'

    That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all?

    I honestly do appreciate all your time

    Joe.

    Eschew obfuscation, espouse eludication!
      if i actually type a micron into the string using Alt-0181 then i get the following output...

      Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex).

      You're then telling Perl that this string is UTF-8 (i.e. the decode("utf8",$clob) statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD.

      When you interpret/display those three bytes as ISO-Latin1 characters they appear as "�", i.e. ï = EF, ¿ = BF, ½ = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either...

      IOW, everything behaves as expected. :)

        hi,

        So, how would i get round the problem of question marks being both displayed in my terminal for microns and also in any output that is written to a file? When i open my output file in a hex editor, a 3F is displayed for the question mark - indicating that an actual ? is written and it isnt a foreign character. No strange chars like above show up.

        Im think im hitting a brick wall with this.

        Thanks

        Joe

        Eschew obfuscation, espouse eludication!
Re^4: Character encoding of microns
by punkish (Priest) on Feb 07, 2009 at 13:04 UTC
    what a wonderful and careful explanation. oshalla++. This reply should be front-paged on its own.
    --

    when small people start casting long shadows, it is time to go to bed