Re: Character encoding of microns

Replies are listed 'Best First'.
Re^2: Character encoding of microns by joec_ (Scribe) on Feb 06, 2009 at 21:58 UTC
Hi, i tried your bytes code with this data: `use Encode; $clob = "this is string with [micro sign here] in it"; $convertedstr = decode("utf8",$clob); print $clob; print $convertedstr; bytes($clob) ; bytes($convertedstr) ; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;` [download] The output of which was: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8 so, as you can see, it all matches up. Its interesting that i tried your code on my Mac at home so i will have to try it at work. I printed the text before / after conversion, and it prints ok (with micro symbol) before, but after using decode, displays ? on my Mac What does this mean then? Like i said, i will try your code at work, but currently the text displays ? before and after conversion. I use 'more' on linux at work and Notepad++ at work on Windows, both display ? Thanks Joe Eschew obfuscation, espouse eludication!	[reply] [d/l]
Re^3: Character encoding of microns by gone2015 (Deacon) on Feb 07, 2009 at 00:37 UTC
As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see... What you are seeing when you print to `STDOUT` takes a little explaining... By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1). When you print the "byte" string, Perl sends the bytes, untouched, to `STDOUT` -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character. When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge. You can tell `STDOUT` that it's a UTF-8 file-handle using `binmode`, so: use strict ; use warnings ; use PerlIO ; use Encode; my $clob = "this is string with \x{C2}\x{B5} in it"; my $convertedstr = decode("utf8",$clob); print "clob: " ; bytes($clob) ; print "conv: " ; bytes($convertedstr) ; my @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; binmode(STDOUT, ":encoding(UTF-8)") ; @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ; [download] where the `PerlIO::get_layers` is returning information about how the file-handle is configured. This produces: clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with Е in it' conv: 'this is string with ▒ in it' unix perlio encoding(utf-8-strict) utf8 clob: 'this is string with ТЕ in it' conv: 'this is string with Е in it' So now you're asking yourself, where the MUMBLE did the '`ТЕ`' come from. Well... `$clob` is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is '`Т`'. The message is that you have to be consistent: you can operate with byte strings that contain UTF-8 sequences, and provided you leave your file handles with no explicit encoding, those UTF-8 sequences will pass through untouched. Which is fine if the target device expects UTF-8 sequences. But, of course, those UTF-8 sequences will look like two (or more) LATIN1 characters if you process the strings. you can operate with utf8 strings that contain "wide characters" (held internally as UTF-8 sequences, as it happens), and provided you set your file handles to `:encoding(UTF-8)` those wide characters will be encoded/decoded as they are output/input. You can also operate with byte strings that contain LATIN1 characters, and file handles set to `:encoding(UTF-8)` will encoded characters as they are output. Or you can leave you file handles with no explicit encoding, and encode/decode strings explicitly before output and after input. But if you try mixing the two, confusion will reign. See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.	[reply] [d/l] [select]
Re^4: Character encoding of microns by joec_ (Scribe) on Feb 10, 2009 at 11:16 UTC
Hi, Im grateful for your detailed explanation. But, i am still having problems. If i run your code, with the micron encoded as `\x{C2}\x{B5}` then just using `decode('utf8',$clob)` seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff. `clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with ТЕ in it' conv: 'this is string with Е in it' unix perlio encoding(utf8) utf8 clob: 'this is string with УТЕ in it' conv: 'this is string with ТЕ in it'` [download] However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned `use diagnostics` on. clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:B5: +20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:EF:B +F:BD:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with Е in it' Wide character in print at 742047.pl line 19 (#1) (W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. conv: 'this is string with яПН in it' unix perlio encoding(utf8) utf8 clob: 'this is string with ТЕ in it' conv: 'this is string with яПН in it' [download] That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all? I honestly do appreciate all your time Joe. Eschew obfuscation, espouse eludication!	[reply] [d/l] [select]
Re^5: Character encoding of microns by almut (Canon) on Feb 10, 2009 at 14:56 UTC
Re^6: Character encoding of microns by joec_ (Scribe) on Feb 12, 2009 at 09:27 UTC
Some notes below your chosen depth have not been shown here
Re^4: Character encoding of microns by punkish (Priest) on Feb 07, 2009 at 13:04 UTC
what a wonderful and careful explanation. oshalla++. This reply should be front-paged on its own. -- when small people start casting long shadows, it is time to go to bed	[reply]


No such thing as a small change
	PerlMonks