Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see...

What you are seeing when you print to STDOUT takes a little explaining...

By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1).

When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character.

When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge.

You can tell STDOUT that it's a UTF-8 file-handle using binmode, so:

use strict ; use warnings ; use PerlIO ; use Encode; my $clob = "this is string with \x{C2}\x{B5} in it"; my $convertedstr = decode("utf8",$clob); print "clob: " ; bytes($clob) ; print "conv: " ; bytes($convertedstr) ; my @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; binmode(STDOUT, ":encoding(UTF-8)") ; @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces:
clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte
conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8
unix perlio
clob: 'this is string with  in it'
conv: 'this is string with ▒ in it'
unix perlio encoding(utf-8-strict) utf8
clob: 'this is string with µ in it'
conv: 'this is string with  in it'
So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is ''.

The message is that you have to be consistent:

  • you can operate with byte strings that contain UTF-8 sequences, and provided you leave your file handles with no explicit encoding, those UTF-8 sequences will pass through untouched. Which is fine if the target device expects UTF-8 sequences.

    But, of course, those UTF-8 sequences will look like two (or more) LATIN1 characters if you process the strings.

  • you can operate with utf8 strings that contain "wide characters" (held internally as UTF-8 sequences, as it happens), and provided you set your file handles to :encoding(UTF-8) those wide characters will be encoded/decoded as they are output/input.

    You can also operate with byte strings that contain LATIN1 characters, and file handles set to :encoding(UTF-8) will encoded characters as they are output.

    Or you can leave you file handles with no explicit encoding, and encode/decode strings explicitly before output and after input.

But if you try mixing the two, confusion will reign.

See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.


In reply to Re^3: Character encoding of microns by oshalla
in thread Character encoding of microns by joec_

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (7)
    As of 2014-07-11 10:39 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (224 votes), past polls