in reply to Character encoding of microns
Have a look at what $clob and $convertedstr contain:
which may tell you where your microns are getting lost.bytes($clob) ; bytes($convertedstr) ; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Character encoding of microns
by joec_ (Scribe) on Feb 06, 2009 at 21:58 UTC | |
The output of which was: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8 so, as you can see, it all matches up. Its interesting that i tried your code on my Mac at home so i will have to try it at work. I printed the text before / after conversion, and it prints ok (with micro symbol) before, but after using decode, displays ? on my Mac What does this mean then? Like i said, i will try your code at work, but currently the text displays ? before and after conversion. I use 'more' on linux at work and Notepad++ at work on Windows, both display ? Thanks Joe
Eschew obfuscation, espouse eludication!
| [reply] [d/l] |
by gone2015 (Deacon) on Feb 07, 2009 at 00:37 UTC | |
As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see... What you are seeing when you print to STDOUT takes a little explaining... By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1). When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character. When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge. You can tell STDOUT that it's a UTF-8 file-handle using binmode, so: where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces: clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with ▒ in it' unix perlio encoding(utf-8-strict) utf8 clob: 'this is string with µ in it' conv: 'this is string with µ in it'So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is 'Â'. The message is that you have to be consistent:
But if you try mixing the two, confusion will reign. See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode. | [reply] [d/l] [select] |
by joec_ (Scribe) on Feb 10, 2009 at 11:16 UTC | |
If i run your code, with the micron encoded as \x{C2}\x{B5} then just using decode('utf8',$clob) seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff. However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned use diagnostics on.
That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all? I honestly do appreciate all your time Joe.
Eschew obfuscation, espouse eludication!
| [reply] [d/l] [select] |
by almut (Canon) on Feb 10, 2009 at 14:56 UTC | |
by joec_ (Scribe) on Feb 12, 2009 at 09:27 UTC | |
| |
by punkish (Priest) on Feb 07, 2009 at 13:04 UTC | |
--
when small people start casting long shadows, it is time to go to bed | [reply] |