Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Character in 'b' format wrapped in unpack

by BrowserUk (Pope)
on Mar 29, 2015 at 06:06 UTC ( #1121698=note: print w/replies, xml ) Need Help??


in reply to Re: Character in 'b' format wrapped in unpack
in thread Character in 'b' format wrapped in unpack

when you pass a numeric value greater than 255 to chr, it must return a wide character.

There is no "must" about it. It should be the case that unless I specifically ask for Unicrap, characters should be assumed to be 8-bits.

I'm afraid I don't quite understand the reason(s) for what happens when the "use bytes" pragma is added -- if I've done it right, the only difference is to eliminate the warning message about the "wrapped character in unpack"

You're right. It does just enough to lull you into a false sense of security; then sneaks around behind and kicks you in the nuts!

Rant:

People complain about the effects that the inclusion of threads has on code for those that don't use them -- a modest increase in executable size and a few single digits of outright performance in tests deliberately designed to show it -- but the deleterious affects of the inclusion of Unicrap are far more pervasive and damaging.

Not only does it bloat the source code and executable, and hit the performance of just about every operation even when your not using it; it subtly (and often silently) changes the semantics of code that isn't even text processing; let alone Unicrap processing.

The use utf was the right way to go. Without it, byte semantics; with it, you made your own bed; so live with it.

But then some bright spark came along and decided he could make it transparent; and now we're all f*****!

Is it the case that you got the particular pattern of zeros and ones you expected, and were just complaining about the warning message?)

No. I wanted the shift to discard the high bit, as it does with integers:

$n <<= 1; print unpack 'B*', pack 'N', $n;; 10101011010101001010101101010100 $n <<= 1; print unpack 'B*', pack 'N', $n;; 01010110101010010101011010101000 $n <<= 1; print unpack 'B*', pack 'N', $n;; 10101101010100101010110101010000

Unfortunately, Unicrap (and Perl's implementation of Unicrap) conspire such that you can no longer rely upon simple byte semantics.

The idea that a string (a good old array of bytes) can suddenly contain a random Unicrap character in a program that doesn't (and doesn't want to) use any Unicrap, is a farce!


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^3: Character in 'b' format wrapped in unpack
by graff (Chancellor) on Mar 29, 2015 at 20:08 UTC
    With all due respect for your justifiable anger, I'm sorry to disagree; "chr()" is - and rightly should be - intended to serve the (dominant) linguistic sense of "character" (what the perl docs call "character semantics"), rather than the strictly-typed, C-centric sense of "char" (what the perl docs call "octets" or "byte semantics").

    In other words, when you want to do low-level, C-like bit twiddling, just use pack and unpack - that's what those are for - and give up on pretending that higher-level, linguistically oriented functions (chr and ord) can do the same thing.

    I agree how sad it is that every user must pay the performance cost of unicode support, whether or not they actually need or use it. But then, it's also sad that every script must pay the overhead for untyped variables, no matter how much of that flexibility is actually needed or used.

    UPDATE: Having said that, I realize I'm probably still deficient in my understanding of your particular example. You said you "wanted the shift to discard the high bit, as it does with integers", and if I'm not mistaken (am I?), that's actually what happens, with or without the "use bytes" pragma (i.e. with or without the warning). Here's a simpler example - am I missing something?

    #use bytes; $x = "\xAA"; print unpack 'B*', $x; print " --> "; $x = chr( ord( $x ) << 1 ); print unpack 'B*', $x; print "\n---\n"; $x = pack( "C*", 0xAA ); print unpack 'B*', $x; print " --> "; $x = pack( "C", unpack( "C", $x ) << 1 ); print unpack 'B*', $x; print "\n---\n";
    When I run that, I get:
    Character in 'B' format wrapped in unpack at /tmp/j1.pl line 7. 10101010 --> 01010100 --- Character in 'C' format wrapped in pack at /tmp/j1.pl line 14. 10101010 --> 01010100 ---
    (Note the warning from using the "C" format on pack.) Looks to me like the high bit got shifted off in both cases - no difference. When I uncomment the "use bytes", the only difference I see is that the "B" format warning goes away (but the "C" format warning still shows up.) Is there a problem I'm not seeing?

    In case it matters, I'm using perl 5.18 on macosx 10.10.2 ("yoesemite").

      I'm sorry to disagree

      I guess we'll have to agree to differ; but the fact that Perl allows me to replace an (8-bit) character, in the middle of a string of 8-bit characters, with some (random*) wide character is just broken.

      "chr()" is - and rightly should be - intended to serve the (dominant) linguistic sense of "character" (what the perl docs call "character semantics")

      To what possible end?

      When you do my $thing = chr( 12345 ); what does that "character" represent?

      Is a Chinese character? Or Sanskrit? Or Cyrillic?

      Is it utf-8; utf16; utf32?

      Is it big-endian or little-endian?

      What if I append another character to it: $thing .= chr( $i );. What does string contain now? Can Perl ever decide what encoding $thing contains?

      And the answer to all of those questions is: it is impossible to ever know. Thus, chr's ability to construct wide characters is entirely useless.

      So, you break with clearly defined semantics for undefined and undefinable semantics, for what purpose?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        When you do my $thing = chr( 12345 ); what does that "character" represent?

        Is a Chinese character? Or Sanskrit? Or Cyrillic?

        Is it utf-8; utf16; utf32?

        Is it big-endian or little-endian?

        It's Unicode. It's HANGZHOU NUMERAL TWENTY, in fact. UTF-8, UTF-16 both represent unicode codepoints, but encode them differently.

        When you concatenate a different string to it, the result might depend on the version of Perl. See unicode_strings.

        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        "Character" just means "string element". In C, they are usually 8 (but 9 bits and other widths are possible). In Perl, they are far bigger. In both languages, they are numbers devoid of intrinsic meaning. They can be all of the things you specified, or something completely different.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1121698]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2019-08-21 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?