Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Potential bug in chr

by roboticus (Chancellor)
on Feb 05, 2018 at 00:58 UTC ( [id://1208450]=note: print w/replies, xml ) Need Help??


in reply to Potential bug in chr

perlboy_emeritus:

The use utf8; directive only tells perl that you're using Unicode in your source code file. It doesn't tell perl to perform any automatic conversions.

The documentation (perldoc -f chr) explicitly states that chr doesn't encode the characters 128..255 (which includes 0xB0) as UTF-8 internally for backward compatibility reasons.

However, if you tell perl to add the utf8 encoding to the output stream, then the 0xb0 will be encoded on output as you want:

$ perl -e 'binmode STDOUT,":utf8"; print chr(0xb0),"\n"' °

Update: I'm not really all that comfortable with Unicode stuff, so reaching for Devel::Peek, I fabricobbled this little thing together:

$ cat pm1208450.pl use strict; use warnings; use Devel::Peek; my $a = chr(0xb0); my $b = chr(0x2032); Dump($a); Dump($b); # Combining a byte string and a unicode string converts to unicode my $c = $a . $b; Dump($c); $ perl pm1208450.pl SV = PV(0x60002c270) at 0x600079168 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x600069e70 "\260"\0 CUR = 1 LEN = 10 SV = PV(0x60002c310) at 0x600079048 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x60008f670 "\342\200\262"\0 [UTF8 "\x{2032}"] CUR = 3 LEN = 10 SV = PV(0x60002c340) at 0x6000ed1c8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x600093a10 "\302\260\342\200\262"\0 [UTF8 "\x{b0}\x{2032}"] CUR = 5 LEN = 10

This shows that if you happen to join a byte-oriented string with a unicode string in perl, the result will be a unicode string.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Potential bug in chr
by perlboy_emeritus (Scribe) on Feb 05, 2018 at 01:25 UTC
    Thanks. Got it. I'll include that binmode expression whenever I'm working with external UTF-8 data and need an accurate debugging environment. I already know to assert "<:encoding(UTF-8)" on input file handles but overlooked STDOUT.
      use open ':std', ':encoding(UTF-8)';
      makes far more sense than
      binmode STDOUT, ':utf8';

      It binmodes STDIN, STDOUT and STDERR (with the safer :encoding(UTF-8)). It also sets the default encoding for the instances of open in the scope (making the :encoding('UTF-8') redundant in the open).


      This shows that if you happen to join a byte-oriented string with a unicode string in perl, the result will be a unicode string.

      Which is irrelevant to the question at hand.

      The first print worked because the string contained non-bytes (chars outside of 0..255), which can't be printed without encoding. perl guessed that you meant to encode them using UTF-8 (and warns you about this ("Wide character in...")).

      perl had no way of knowing the second print was wrong because it only contained bytes (chars in 0..255), so it printed the string unaltered.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1208450]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-24 09:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found