Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Character encoding of microns

by joec_ (Scribe)
on Feb 06, 2009 at 21:58 UTC ( #742016=note: print w/ replies, xml ) Need Help??


in reply to Re: Character encoding of microns
in thread Character encoding of microns

Hi, i tried your bytes code with this data:

use Encode; $clob = "this is string with [micro sign here] in it"; $convertedstr = decode("utf8",$clob); print $clob; print $convertedstr; bytes($clob) ; bytes($convertedstr) ; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;

The output of which was:

74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte

74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8

so, as you can see, it all matches up. Its interesting that i tried your code on my Mac at home so i will have to try it at work. I printed the text before / after conversion, and it prints ok (with micro symbol) before, but after using decode, displays ? on my Mac

What does this mean then? Like i said, i will try your code at work, but currently the text displays ? before and after conversion. I use 'more' on linux at work and Notepad++ at work on Windows, both display ?

Thanks

Joe

Eschew obfuscation, espouse eludication!


Comment on Re^2: Character encoding of microns
Download Code
Re^3: Character encoding of microns
by oshalla (Deacon) on Feb 07, 2009 at 00:37 UTC

    As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see...

    What you are seeing when you print to STDOUT takes a little explaining...

    By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1).

    When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character.

    When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge.

    You can tell STDOUT that it's a UTF-8 file-handle using binmode, so:

    use strict ; use warnings ; use PerlIO ; use Encode; my $clob = "this is string with \x{C2}\x{B5} in it"; my $convertedstr = decode("utf8",$clob); print "clob: " ; bytes($clob) ; print "conv: " ; bytes($convertedstr) ; my @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; binmode(STDOUT, ":encoding(UTF-8)") ; @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
    where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces:
    clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte
    conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8
    unix perlio
    clob: 'this is string with  in it'
    conv: 'this is string with ▒ in it'
    unix perlio encoding(utf-8-strict) utf8
    clob: 'this is string with µ in it'
    conv: 'this is string with  in it'
    
    So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is ''.

    The message is that you have to be consistent:

    • you can operate with byte strings that contain UTF-8 sequences, and provided you leave your file handles with no explicit encoding, those UTF-8 sequences will pass through untouched. Which is fine if the target device expects UTF-8 sequences.

      But, of course, those UTF-8 sequences will look like two (or more) LATIN1 characters if you process the strings.

    • you can operate with utf8 strings that contain "wide characters" (held internally as UTF-8 sequences, as it happens), and provided you set your file handles to :encoding(UTF-8) those wide characters will be encoded/decoded as they are output/input.

      You can also operate with byte strings that contain LATIN1 characters, and file handles set to :encoding(UTF-8) will encoded characters as they are output.

      Or you can leave you file handles with no explicit encoding, and encode/decode strings explicitly before output and after input.

    But if you try mixing the two, confusion will reign.

    See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.

      what a wonderful and careful explanation. oshalla++. This reply should be front-paged on its own.
      --

      when small people start casting long shadows, it is time to go to bed
      Hi, Im grateful for your detailed explanation. But, i am still having problems.

      If i run your code, with the micron encoded as \x{C2}\x{B5} then just using decode('utf8',$clob) seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff.

      clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with µ in it'
      However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned use diagnostics on.
      clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:B5: +20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:EF:B +F:BD:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with in it' Wide character in print at 742047.pl line 19 (#1) (W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. conv: 'this is string with � in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with � in it'

      That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all?

      I honestly do appreciate all your time

      Joe.

      Eschew obfuscation, espouse eludication!
        if i actually type a micron into the string using Alt-0181 then i get the following output...

        Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex).

        You're then telling Perl that this string is UTF-8 (i.e. the decode("utf8",$clob) statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD.

        When you interpret/display those three bytes as ISO-Latin1 characters they appear as "�", i.e. = EF, = BF, = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either...

        IOW, everything behaves as expected. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://742016]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2014-09-16 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (157 votes), past polls