Re^3: Character encoding of microns

in reply to Re^2: Character encoding of microns
in thread Character encoding of microns

As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see...

What you are seeing when you print to STDOUT takes a little explaining...

By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1).

When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character.

When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge.

You can tell STDOUT that it's a UTF-8 file-handle using binmode, so:

use strict ;
use warnings ;

use PerlIO ;
use Encode;

my $clob = "this is string with \x{C2}\x{B5} in it";

my $convertedstr = decode("utf8",$clob);

print "clob: " ; bytes($clob) ;
print "conv: " ; bytes($convertedstr) ;

my @layers = PerlIO::get_layers(STDOUT) ; 
print "@layers\n" ;

print "clob: '$clob'\n" ;
print "conv: '$convertedstr'\n";

binmode(STDOUT, ":encoding(UTF-8)") ;

@layers = PerlIO::get_layers(STDOUT) ; 
print "@layers\n" ;

print "clob: '$clob'\n" ;
print "conv: '$convertedstr'\n";

sub bytes {
  my ($s) = @_ ;
  my $w = utf8::is_utf8($s) ? "utf8" : "byte" ;
  use bytes ;
  print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w
+\n" ;
} ;
[download]

where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces:

clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte
conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8
unix perlio
clob: 'this is string with Е in it'
conv: 'this is string with ▒ in it'
unix perlio encoding(utf-8-strict) utf8
clob: 'this is string with ТЕ in it'
conv: 'this is string with Е in it'

So now you're asking yourself, where the MUMBLE did the 'ТЕ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is 'Т'.

The message is that you have to be consistent:

you can operate with byte strings that contain UTF-8 sequences, and provided you leave your file handles with no explicit encoding, those UTF-8 sequences will pass through untouched. Which is fine if the target device expects UTF-8 sequences.

But, of course, those UTF-8 sequences will look like two (or more) LATIN1 characters if you process the strings.
you can operate with utf8 strings that contain "wide characters" (held internally as UTF-8 sequences, as it happens), and provided you set your file handles to :encoding(UTF-8) those wide characters will be encoded/decoded as they are output/input.

You can also operate with byte strings that contain LATIN1 characters, and file handles set to :encoding(UTF-8) will encoded characters as they are output.

Or you can leave you file handles with no explicit encoding, and encode/decode strings explicitly before output and after input.

But if you try mixing the two, confusion will reign.

See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.

Comment on Re^3: Character encoding of microns Select or Download Code

Replies are listed 'Best First'.
Re^4: Character encoding of microns by joec_ (Scribe) on Feb 10, 2009 at 11:16 UTC
Hi, Im grateful for your detailed explanation. But, i am still having problems. If i run your code, with the micron encoded as `\x{C2}\x{B5}` then just using `decode('utf8',$clob)` seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff. `clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with ТЕ in it' conv: 'this is string with Е in it' unix perlio encoding(utf8) utf8 clob: 'this is string with УТЕ in it' conv: 'this is string with ТЕ in it'` [download] However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned `use diagnostics` on. clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:B5: +20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:EF:B +F:BD:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with Е in it' Wide character in print at 742047.pl line 19 (#1) (W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. conv: 'this is string with яПН in it' unix perlio encoding(utf8) utf8 clob: 'this is string with ТЕ in it' conv: 'this is string with яПН in it' [download] That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all? I honestly do appreciate all your time Joe. Eschew obfuscation, espouse eludication!	[reply] [d/l] [select]
Re^5: Character encoding of microns by almut (Canon) on Feb 10, 2009 at 14:56 UTC
if i actually type a micron into the string using Alt-0181 then i get the following output... Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex). You're then telling Perl that this string is UTF-8 (i.e. the `decode("utf8",$clob)` statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD. When you interpret/display those three bytes as ISO-Latin1 characters they appear as "яПН", i.e. я = EF, П = BF, Н = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either... IOW, everything behaves as expected. :)	[reply] [d/l]
Re^6: Character encoding of microns by joec_ (Scribe) on Feb 12, 2009 at 09:27 UTC
hi, So, how would i get round the problem of question marks being both displayed in my terminal for microns and also in any output that is written to a file? When i open my output file in a hex editor, a 3F is displayed for the question mark - indicating that an actual ? is written and it isnt a foreign character. No strange chars like above show up. Im think im hitting a brick wall with this. Thanks Joe Eschew obfuscation, espouse eludication!	[reply]
Re^7: Character encoding of microns by graff (Chancellor) on Feb 13, 2009 at 06:10 UTC
Re^4: Character encoding of microns by punkish (Priest) on Feb 07, 2009 at 13:04 UTC
what a wonderful and careful explanation. oshalla++. This reply should be front-paged on its own. -- when small people start casting long shadows, it is time to go to bed	[reply]

In Section Seekers of Perl Wisdom