Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^5: Understanding pack and unpack changes for binary data between 5.8 and 5.10

by squentin (Sexton)
on Mar 13, 2009 at 13:59 UTC ( #750418=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Understanding pack and unpack changes for binary data between 5.8 and 5.10
in thread Understanding pack and unpack changes for binary data between 5.8 and 5.10

I needed the length of the string to write the string and its length in a binary file.

I'm only using _utf8_on in this example, in the original code, the string already had its utf8 flag on (it was coming from gtk2 (which uses utf8 everywhere), so I was expecting it to be utf8-encoded.

I understand that my code was ambiguous because it depends on the internal representation, I've written it a long time ago when I didn't have much experience in perl, and didn't really know how utf8 was handled.

But I don't think using a string in pack should result in something that depends on the internal representation of the string : the internal representation should be internal :)

Honestly, I don't like how utf8 is handled in perl, it tries to do everything automagically, but this makes things less clear.


Comment on Re^5: Understanding pack and unpack changes for binary data between 5.8 and 5.10
Re^6: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Pope) on Mar 13, 2009 at 15:10 UTC

    I'm only using _utf8_on in this example, in the original code, the string already had its utf8 flag on

    utf8::upgrade and utf8::downgrade are the proper way to convert between internal encodings.

    But I don't think using a string in pack should result in something that depends on the internal representation of the string : the internal representation should be internal :)

    Exactly. In 5.10.0, exactly the same string is produced no mater what the internal encoding is.

    use strict; use warnings; use Carp qw( croak ); sub avoid_utf8_internally { my ($s) = @_; utf8::downgrade($s, 1) or croak("Non-bytes found in input"); return $s; } sub use_utf8_internally { my ($s) = @_; utf8::upgrade($s); return $s; } my $file_num; for my $s ( avoid_utf8_internally("bj\x{f6}rk"), use_utf8_internally("bj\x{f6}rk"), "b" x 1000, ) { my $packed = pack("V/a*", $s); printf("%s -> %s\n", length($s), length($packed)); open(my $fh, '>', 'packed'.++$file_num) or die; binmode $fh; # No crlf mucking. print $fh $packed; }
    >perl script.pl 5 -> 9 5 -> 9 1000 -> 1004 >debug packed1 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed2 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed3 -rcx CX 03EC : -d100 3EC 0B14:0100 E8 03 00 00 62 62 62 62-62 62 62 62 62 62 62 62 ....bbbbb +bbbbbbb 0B14:0110 62 62 62 62 62 62 62 62-62 62 62 62 62 62 62 62 bbbbbbbbb +bbbbbbb ... 0B14:04D0 62 62 62 62 62 62 62 62-62 62 62 62 62 62 62 62 bbbbbbbbb +bbbbbbb 0B14:04E0 62 62 62 62 62 62 62 62-62 62 62 62 bbbbbbbbb +bbb -q

    5.8.8, on the other hand, *is* depended on the internal encoding.

    >debug packed1 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A F6 72-6B ....bj.rk -q >debug packed2 -rcx CX 0009 : -d100 l9 0B14:0100 05 00 00 00 62 6A C3 B6-72 ....bj..r -q

    I don't like how utf8 is handled in perl,

    You shouldn't even have to know about the internal encoding. Fixing this is an ongoing process, and that's precisely why packed was changed in 5.10.0. Why are complaining about such a fix?!

Re^6: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Pope) on Mar 13, 2009 at 16:10 UTC
    A proper test:
    use strict; use warnings; use Test::More tests => 2 * ( 2 + 6 + 6 ); use Carp qw( croak ); sub avoid_utf8 { my ($s) = @_; utf8::downgrade($s, 1) or croak("Input not a string of bytes"); return $s; } sub use_utf8 { my ($s) = @_; utf8::upgrade($s); return $s; } diag("Perl $]"); for ( [ "bj\x{f6}rk", 5, 'hibit' ], [ "b" x 1000, 1000, 'long' ], ) { my ($s, $length, $test_name) = @$_; # length my %length; $length{'0'} = length avoid_utf8 $s; $length{'1'} = length use_utf8 $s; for my $enc (qw( 0 1 )) { is($length{$enc}, $length, "length $test_name $enc"); } # pack 'V/a*' my $expected = pack('V', $length) . $s; my %packed; $packed{'?0'} = pack "V/a*", avoid_utf8 $s; $packed{'?1'} = pack "V/a*", use_utf8 $s; $packed{'00'} = avoid_utf8 pack "V/a*", avoid_utf8 $s; $packed{'01'} = avoid_utf8 pack "V/a*", use_utf8 $s; $packed{'10'} = use_utf8 pack "V/a*", avoid_utf8 $s; $packed{'11'} = use_utf8 pack "V/a*", use_utf8 $s; for my $enc (qw( ?0 ?1 00 01 10 11 )) { ok($packed{$enc} eq $expected, "pack $test_name $enc"); } # print for my $enc (qw( ?0 ?1 00 01 10 11 )) { my $buf = ''; { open(my $fh, '>', \$buf); binmode $fh; # No mucking with crlf print $fh $packed{$enc}; } ok($buf eq $expected, "print $test_name $enc"); } }
    >c:\progs\perl588\bin\perl test.pl 1..28 # Perl 5.008008 ok 1 - length hibit 0 ok 2 - length hibit 1 ok 3 - pack hibit ?0 not ok 4 - pack hibit ?1 # Failed test 'pack hibit ?1' # at test.pl line 54. ok 5 - pack hibit 00 not ok 6 - pack hibit 01 # Failed test 'pack hibit 01' # at test.pl line 54. ok 7 - pack hibit 10 not ok 8 - pack hibit 11 # Failed test 'pack hibit 11' # at test.pl line 54. ok 9 - print hibit ?0 not ok 10 - print hibit ?1 # Failed test 'print hibit ?1' # at test.pl line 67. ok 11 - print hibit 00 not ok 12 - print hibit 01 # Failed test 'print hibit 01' # at test.pl line 67. ok 13 - print hibit 10 not ok 14 - print hibit 11 # Failed test 'print hibit 11' # at test.pl line 67. ok 15 - length long 0 ok 16 - length long 1 ok 17 - pack long ?0 ok 18 - pack long ?1 ok 19 - pack long 00 ok 20 - pack long 01 ok 21 - pack long 10 ok 22 - pack long 11 ok 23 - print long ?0 ok 24 - print long ?1 ok 25 - print long 00 ok 26 - print long 01 ok 27 - print long 10 ok 28 - print long 11 # Looks like you failed 6 tests of 28.
    >c:\progs\perl5100\bin\perl test.pl 1..28 # Perl 5.010000 ok 1 - length hibit 0 ok 2 - length hibit 1 ok 3 - pack hibit ?0 ok 4 - pack hibit ?1 ok 5 - pack hibit 00 ok 6 - pack hibit 01 ok 7 - pack hibit 10 ok 8 - pack hibit 11 ok 9 - print hibit ?0 ok 10 - print hibit ?1 ok 11 - print hibit 00 ok 12 - print hibit 01 ok 13 - print hibit 10 ok 14 - print hibit 11 ok 15 - length long 0 ok 16 - length long 1 ok 17 - pack long ?0 ok 18 - pack long ?1 ok 19 - pack long 00 ok 20 - pack long 01 ok 21 - pack long 10 ok 22 - pack long 11 ok 23 - print long ?0 ok 24 - print long ?1 ok 25 - print long 00 ok 26 - print long 01 ok 27 - print long 10 ok 28 - print long 11

    Internal encoding surfaces in 5.8.8, but not in 5.10.0 (for the functions tested).

      Ok, I'll try to be clear this time :)
      What I wanted is write the string encoded in utf8, and the length, in bytes, of the binary string resulting from pack. So I was using :
      my $p=pack "V/a*", $s; my $l=length $p;
      When I should have been using :
      use Encode qw/encode/; my $p=pack "V/a*", encode('utf8',$s); my $l=bytes::length $p; # using bytes::length just to be sure, $p shouldn't have its utf8 flag + on, but in case it does...
      Thinking about it a little more, I think what is disturbing me is that the 'a' in the pack format can be a multi-bytes character. And more generally, the idea that utf8 strings are strings of multi-bytes characters, rather than strings of bytes in utf8 encoding.
      perl 5.10's pack behavior does seem to make more sense now.

        I think what is disturbing me is that the 'a' in the pack format can be a multi-bytes character.

        Me too. You've gotta wonder what's going to happen more often: someone wanting pack non-encoded characters or someone accidentally packing non-encoded characters. I would say the latter, so I find it weird that it doesn't croak ("Wide char in ...") when passed non-encoded characters.

        It could be a side effect of allowing pack and unpack to work with fixed-width fields, where the width is in characters rather than bytes.

        my $rec_format = 'a4a5a1'; my $rec_size = 10; binmode $fh_out, ':encoding(UTF-8)'; print $fh_out pack($rec_format, @fields); ... binmode $fh_in, ':encoding(UTF-8)'; read($fh_in, my $rec = '', $rec_size); @fields = unpack($rec_format, $rec);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://750418]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2014-12-25 11:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls