Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^4: Character in 'b' format wrapped in unpack

by BrowserUk (Pope)
on Mar 29, 2015 at 20:49 UTC ( #1121734=note: print w/replies, xml ) Need Help??


in reply to Re^3: Character in 'b' format wrapped in unpack
in thread Character in 'b' format wrapped in unpack

I'm sorry to disagree

I guess we'll have to agree to differ; but the fact that Perl allows me to replace an (8-bit) character, in the middle of a string of 8-bit characters, with some (random*) wide character is just broken.

"chr()" is - and rightly should be - intended to serve the (dominant) linguistic sense of "character" (what the perl docs call "character semantics")

To what possible end?

When you do my $thing = chr( 12345 ); what does that "character" represent?

Is a Chinese character? Or Sanskrit? Or Cyrillic?

Is it utf-8; utf16; utf32?

Is it big-endian or little-endian?

What if I append another character to it: $thing .= chr( $i );. What does string contain now? Can Perl ever decide what encoding $thing contains?

And the answer to all of those questions is: it is impossible to ever know. Thus, chr's ability to construct wide characters is entirely useless.

So, you break with clearly defined semantics for undefined and undefinable semantics, for what purpose?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^5: Character in 'b' format wrapped in unpack
by choroba (Bishop) on Mar 29, 2015 at 22:23 UTC
    When you do my $thing = chr( 12345 ); what does that "character" represent?

    Is a Chinese character? Or Sanskrit? Or Cyrillic?

    Is it utf-8; utf16; utf32?

    Is it big-endian or little-endian?

    It's Unicode. It's HANGZHOU NUMERAL TWENTY, in fact. UTF-8, UTF-16 both represent unicode codepoints, but encode them differently.

    When you concatenate a different string to it, the result might depend on the version of Perl. See unicode_strings.

    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      It's Unicode.

      Great! Then this must be unicode also:

      perl -MDevel::Peek -E"$x = chr(129).chr(130).chr(42).chr(131).chr(132) +; Dump($x); substr( $x, 2, 1 ) = chr(~0); Dump($x); print $x" | od -t +x1 SV = PV(0xbadc0) at 0x2c5aa8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xb67a8 "\201\202*\203\204"\0 CUR = 5 LEN = 8 SV = PVMG(0x2b1078) at 0x2c5aa8 REFCNT = 2 FLAGS = (SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x2b3008 "\302\201\302\202\377\200\217\277\277\277\277\277\277\ +277\277\277\277\302\203\302\204"\0 [UTF8 "\x{81}\x{82}\x{ffffffffffff +ffff}\x{83}\x{84}"] CUR = 21 LEN = 24 MAGIC = 0x3177f8 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1 Wide character in print at -e line 1. 0000000 c2 81 c2 82 ff 80 8f bf bf bf bf bf bf bf bf bf 0000020 bf c2 83 c2 84 0000025

      Broken!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        Yes, that code is indeed garbage. Either 0xffffffffffffffff shouldn't have been added to the string, or the string shouldn't have been passed to print. Without an encoding layer, it expects the characters to be bytes (0..255).
Re^5: Character in 'b' format wrapped in unpack
by ikegami (Pope) on Mar 29, 2015 at 23:07 UTC

    "Character" just means "string element". In C, they are usually 8 (but 9 bits and other widths are possible). In Perl, they are far bigger. In both languages, they are numbers devoid of intrinsic meaning. They can be all of the things you specified, or something completely different.

      but 9 bits and other widths are possible

      Oh yeah! Then why didn't they just create a 20.0625-bit character type to hold the 1,114,112 possible codepoints?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1121734]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2019-08-21 22:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?