http://www.perlmonks.org?node_id=749882


in reply to Understanding pack and unpack changes for binary data between 5.8 and 5.10

I think you either have to 'use bytes', or make sure you don't use variables that have their utf8 flag set.

I've been bitten by one of the changes in perl 5.10 :
pack('V/a*',$a) returns a value with the utf8 flag if $a has it, unless you "use bytes". It didn't do that in perl 5.8. Am I the only one to find this new behavior very strange ? the value returned by pack('V/a*',$a) is binary, interpreting it as utf8 makes no sense :(

Replies are listed 'Best First'.
Re^2: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by almut (Canon) on Mar 11, 2009 at 15:31 UTC
    ...It didn't do that in perl 5.8

    Another difference to be aware of is this:

    my $s = "\x{1234}\x{5678}"; # string with utf8 flag on print unpack("H*", $s), "\n";

    With 5.8 this prints a hexdump of the internal (UTF-8) representation of the string — e.g. useful when debugging encoding issues

    e188b4e599b8

    while with 5.10, you'd get

    3478

    i.e. the low-byte values of the codepoints, with the high-byte part being truncated. With warnings enabled, you also get "Character in 'H' format wrapped in unpack at...".

    With use bytes, or when explicitly turning off the utf8 flag (update: as shown below), you get the old behaviour.  And specifically for debugging encoding issues, Devel::Peek is the recommended alternative since 5.10, because of this difference.

      with 5.10, you'd get [...] the low-byte values of the codepoints, with the high-byte part being truncated. With warnings enabled, you also get "Character in 'H' format wrapped in unpack at...".

      It's odd that it doesn't warn or croak with "Wide character in ...".

      If you want to dump the internal buffer,

      use Encode qw( _utf8_off ); sub internal { _utf8_off( my $s = shift ); return $s; } my $s = "\x{1234}\x{5678}"; # string with utf8 flag on print unpack("H*", internal($s)), "\n";

      Update: Fixed error identified in reply.

        utf8::_utf8_off( my $s = shift );

        I think you meant Encode::_utf8_off(...).

      I don't see the problem.

      use strict; use warnings; use Data::Dumper qw( Dumper ); $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = chr(0xC9); utf8::downgrade($s); print(Dumper(unpack('H*', $s)), "\n"); utf8::upgrade($s); print(Dumper(unpack('H*', $s)), "\n"); print(Dumper(unpack('H*', "\x{C9}\x{2660}")), "\n");

      5.10.0:

      "c9" # Ok "c9" # Ok Character in 'H' format wrapped in unpack at 750077.pl line 16. "c960" # GIGO

      The internal representation is and should be irrelevant.

      If you want to see the internal representation, it stands to reason that you should have to explicitely fetch it.

        I don't see the problem...

        I don't see a problem either.  I just pointed out a difference, i.e. that something which people might have gotten used to, no longer behaves the way it did before...

Re^2: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Patriarch) on Mar 12, 2009 at 04:28 UTC

    It's a bit strange, but the internal representation of the string shouldn't* matter.

    What I do find very strange is that it doesn't croak when passed non-bytes.

    use strict; use warnings; use Data::Dumper qw( Dumper ); $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = chr(0xC9); utf8::downgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); utf8::upgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); print(Dumper(pack('V/a*', "\x{C9}\x{2660}")), "\n");

    5.10.0:

    "\1\0\0\0\311" # Ok "\1\0\0\0\x{c9}" # Ok "\2\0\0\0\x{c9}\x{2660}" # Does this make sense???

    On the other hand, 5.8.8 was very broken:

    "\1\0\0\0\311" # Ok "\1\0\0\0\303" # XXX "\2\0\0\0\303\242" # XXX
    * — I realize it matters all to often, but that's getting fixed. In plfaces where it does matter, you can use utf8::upgrade and utf8::downgrade to control the internal format.
      The problem is that when I do a length on the return value. Of course I should have used "bytes", but as I said, the return value is a binary string, so returning a length in utf8 characters is strange.
      And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)
      use Encode qw/_utf8_on/; my $a="bj\xc3\xb6rk"; _utf8_on($a); my $binarystring=pack("V/a*", $a); warn length $binarystring; warn bytes::length $binarystring; my $b="b"x1000; _utf8_on($b); my $binarystring2=pack("V/a*", $b); warn length $binarystring2; warn bytes::length $binarystring2;

        $a is 5 bytes long and pack("v") is 4 bytes long, so $binarystring should hold 9 bytes. length($binarystring) confirms the length, and utf8::downgrade would confirm that they are bytes.

        $b is 1000 bytes long and pack("v") is 4 bytes long, so $binarystring2 should hold 1004 bytes. length($binarystring2) confirms the length, and utf8::downgrade would confirm that they are bytes.

        And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)

        I don't see the problem. Are you expecting something other than 9 and 1004? Yes, the length of the internal representation is different (as reported by bytes::length), but why are you mucking with the internals?

        Speaking of mucking with internals, utf8::decode should normally be used instead of _utf8_on.

        so returning a length in utf8 characters is strange.

        It's a bit odd, but only because it's a bit inefficient.