http://www.perlmonks.org?node_id=750193


in reply to Re^2: Understanding pack and unpack changes for binary data between 5.8 and 5.10
in thread Understanding pack and unpack changes for binary data between 5.8 and 5.10

The problem is that when I do a length on the return value. Of course I should have used "bytes", but as I said, the return value is a binary string, so returning a length in utf8 characters is strange.
And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)
use Encode qw/_utf8_on/; my $a="bj\xc3\xb6rk"; _utf8_on($a); my $binarystring=pack("V/a*", $a); warn length $binarystring; warn bytes::length $binarystring; my $b="b"x1000; _utf8_on($b); my $binarystring2=pack("V/a*", $b); warn length $binarystring2; warn bytes::length $binarystring2;
  • Comment on Re^3: Understanding pack and unpack changes for binary data between 5.8 and 5.10
  • Download Code

Replies are listed 'Best First'.
Re^4: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Patriarch) on Mar 12, 2009 at 16:59 UTC

    $a is 5 bytes long and pack("v") is 4 bytes long, so $binarystring should hold 9 bytes. length($binarystring) confirms the length, and utf8::downgrade would confirm that they are bytes.

    $b is 1000 bytes long and pack("v") is 4 bytes long, so $binarystring2 should hold 1004 bytes. length($binarystring2) confirms the length, and utf8::downgrade would confirm that they are bytes.

    And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)

    I don't see the problem. Are you expecting something other than 9 and 1004? Yes, the length of the internal representation is different (as reported by bytes::length), but why are you mucking with the internals?

    Speaking of mucking with internals, utf8::decode should normally be used instead of _utf8_on.

    so returning a length in utf8 characters is strange.

    It's a bit odd, but only because it's a bit inefficient.

      I needed the length of the string to write the string and its length in a binary file.

      I'm only using _utf8_on in this example, in the original code, the string already had its utf8 flag on (it was coming from gtk2 (which uses utf8 everywhere), so I was expecting it to be utf8-encoded.

      I understand that my code was ambiguous because it depends on the internal representation, I've written it a long time ago when I didn't have much experience in perl, and didn't really know how utf8 was handled.

      But I don't think using a string in pack should result in something that depends on the internal representation of the string : the internal representation should be internal :)

      Honestly, I don't like how utf8 is handled in perl, it tries to do everything automagically, but this makes things less clear.

        I'm only using _utf8_on in this example, in the original code, the string already had its utf8 flag on

        utf8::upgrade and utf8::downgrade are the proper way to convert between internal encodings.

        But I don't think using a string in pack should result in something that depends on the internal representation of the string : the internal representation should be internal :)

        Exactly. In 5.10.0, exactly the same string is produced no mater what the internal encoding is.

        use strict; use warnings; use Carp qw( croak ); sub avoid_utf8_internally { my ($s) = @_; utf8::downgrade($s, 1) or croak("Non-bytes found in input"); return $s; } sub use_utf8_internally { my ($s) = @_; utf8::upgrade($s); return $s; } my $file_num; for my $s ( avoid_utf8_internally("bj\x{f6}rk"), use_utf8_internally("bj\x{f6}rk"), "b" x 1000, ) { my $packed = pack("V/a*", $s); printf("%s -> %s\n", length($s), length($packed)); open(my $fh, '>', 'packed'.++$file_num) or die; binmode $fh; # No crlf mucking. print $fh $packed; }
        >perl script.pl 5 -> 9 5 -> 9 1000 -> 1004 >debug packed1 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed2 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed3 -rcx CX 03EC : -d100 3EC 0B14:0100 E8 03 00 00 62 62 62 62-62 62 62 62 62 62 62 62 ....bbbbb +bbbbbbb 0B14:0110 62 62 62 62 62 62 62 62-62 62 62 62 62 62 62 62 bbbbbbbbb +bbbbbbb ... 0B14:04D0 62 62 62 62 62 62 62 62-62 62 62 62 62 62 62 62 bbbbbbbbb +bbbbbbb 0B14:04E0 62 62 62 62 62 62 62 62-62 62 62 62 bbbbbbbbb +bbb -q

        5.8.8, on the other hand, *is* depended on the internal encoding.

        >debug packed1 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed2 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A C3 B6-72 ....bj..r -q

        I don't like how utf8 is handled in perl,

        You shouldn't even have to know about the internal encoding. Fixing this is an ongoing process, and that's precisely why packed was changed in 5.10.0. Why are complaining about such a fix?!

        A proper test:
        use strict; use warnings; use Test::More tests => 2 * ( 2 + 6 + 6 ); use Carp qw( croak ); sub avoid_utf8 { my ($s) = @_; utf8::downgrade($s, 1) or croak("Input not a string of bytes"); return $s; } sub use_utf8 { my ($s) = @_; utf8::upgrade($s); return $s; } diag("Perl $]"); for ( [ "bj\x{f6}rk", 5, 'hibit' ], [ "b" x 1000, 1000, 'long' ], ) { my ($s, $length, $test_name) = @$_; # length my %length; $length{'0'} = length avoid_utf8 $s; $length{'1'} = length use_utf8 $s; for my $enc (qw( 0 1 )) { is($length{$enc}, $length, "length $test_name $enc"); } # pack 'V/a*' my $expected = pack('V', $length) . $s; my %packed; $packed{'?0'} = pack "V/a*", avoid_utf8 $s; $packed{'?1'} = pack "V/a*", use_utf8 $s; $packed{'00'} = avoid_utf8 pack "V/a*", avoid_utf8 $s; $packed{'01'} = avoid_utf8 pack "V/a*", use_utf8 $s; $packed{'10'} = use_utf8 pack "V/a*", avoid_utf8 $s; $packed{'11'} = use_utf8 pack "V/a*", use_utf8 $s; for my $enc (qw( ?0 ?1 00 01 10 11 )) { ok($packed{$enc} eq $expected, "pack $test_name $enc"); } # print for my $enc (qw( ?0 ?1 00 01 10 11 )) { my $buf = ''; { open(my $fh, '>', \$buf); binmode $fh; # No mucking with crlf print $fh $packed{$enc}; } ok($buf eq $expected, "print $test_name $enc"); } }
        >c:\progs\perl588\bin\perl test.pl 1..28 # Perl 5.008008 ok 1 - length hibit 0 ok 2 - length hibit 1 ok 3 - pack hibit ?0 not ok 4 - pack hibit ?1 # Failed test 'pack hibit ?1' # at test.pl line 54. ok 5 - pack hibit 00 not ok 6 - pack hibit 01 # Failed test 'pack hibit 01' # at test.pl line 54. ok 7 - pack hibit 10 not ok 8 - pack hibit 11 # Failed test 'pack hibit 11' # at test.pl line 54. ok 9 - print hibit ?0 not ok 10 - print hibit ?1 # Failed test 'print hibit ?1' # at test.pl line 67. ok 11 - print hibit 00 not ok 12 - print hibit 01 # Failed test 'print hibit 01' # at test.pl line 67. ok 13 - print hibit 10 not ok 14 - print hibit 11 # Failed test 'print hibit 11' # at test.pl line 67. ok 15 - length long 0 ok 16 - length long 1 ok 17 - pack long ?0 ok 18 - pack long ?1 ok 19 - pack long 00 ok 20 - pack long 01 ok 21 - pack long 10 ok 22 - pack long 11 ok 23 - print long ?0 ok 24 - print long ?1 ok 25 - print long 00 ok 26 - print long 01 ok 27 - print long 10 ok 28 - print long 11 # Looks like you failed 6 tests of 28.
        >c:\progs\perl5100\bin\perl test.pl 1..28 # Perl 5.010000 ok 1 - length hibit 0 ok 2 - length hibit 1 ok 3 - pack hibit ?0 ok 4 - pack hibit ?1 ok 5 - pack hibit 00 ok 6 - pack hibit 01 ok 7 - pack hibit 10 ok 8 - pack hibit 11 ok 9 - print hibit ?0 ok 10 - print hibit ?1 ok 11 - print hibit 00 ok 12 - print hibit 01 ok 13 - print hibit 10 ok 14 - print hibit 11 ok 15 - length long 0 ok 16 - length long 1 ok 17 - pack long ?0 ok 18 - pack long ?1 ok 19 - pack long 00 ok 20 - pack long 01 ok 21 - pack long 10 ok 22 - pack long 11 ok 23 - print long ?0 ok 24 - print long ?1 ok 25 - print long 00 ok 26 - print long 01 ok 27 - print long 10 ok 28 - print long 11

        Internal encoding surfaces in 5.8.8, but not in 5.10.0 (for the functions tested).