http://www.perlmonks.org?node_id=11128489


in reply to What does utf8::upgrade actually do.

Perl's strings are formally list of codepoints. So the following applies:
@codepoints = map ord($_), split //, $s1; $s2 = join '', map chr($_), @codepoints; ok($s1 eq $s2); ok(length($s1) == length($s2);
How perl internally stores those those codepoints is up to perl, and perl-level code mostly needn't care about the difference. XS code on the other hand needs to know about it if its going to start rummaging around accessing the individual bytes making up the string's storage. Currently perl uses two storage formats - traditional one byte per codepoint and utf8 variable length encoding. The encoding is indicated by the SVf_UTF8 flag. You can't guarantee which encoding will be used - that's up to perl. For example at the moment:
$s1 = "abc\x80"; $s2 = $s1; # currently SVf_UTF8 not set; string uses 4 bytes of + storage $s2 .= "\x{100}"; # currently perl upgrades to SVf_UTF8 and converts t +he 0x80 and 0x100 into multi-byte representations chop($s2); # currently perl doesn't downgrade; the 0x80 codepoi +nt still stored as 2 bytes ok($s1 eq $s2); ok(length($s1) == length($s2));
utf8::upgrade() and utf8::downgrade() are just ways of forcing the encoding format of the internal representation - useful for demonstrating bugs in modules which make assumptions. Note that they don't change the semantics of the string - perl thinks they still have the same length and codepoints. To continue the example above:
utf8::downgrade($s2); # the the 0x80 codepoint now stored as 1 byte ok(length($s1) == $length($s2)); ok($s1 eq $s2);
What an XS module does when it wants to process a list of bytes is of course up to the XS module's author. However, just using the current bytes of the internal representation is probably a poor choice - two strings which are semantically identical at the perl level but have different internal representations will give different results (e.g. the $s1 above initially and the $s2 after the chop()). If there is no sensible interpretation of the meaning of codepoints > 0xff then I would suggest the XS code should check the SVf_UTF8 flag and if present,try to downgrade the string, and if not possible, croak.

Dave.

Replies are listed 'Best First'.
Re^2: What does utf8::upgrade actually do.
by syphilis (Archbishop) on Feb 17, 2021 at 12:34 UTC
    What an XS module does when it wants to process a list of bytes is of course up to the XS module's author

    Ok - but I guess the module author (me) should probably document the procedure that the module takes. (The lack of any such documentation seems to have been a part of ribasushi's objection, and I think that's fair enough.)

    For simplicity, let's stick to a single-byte string:
    use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr(ord $v); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # prints the value assigned to $v (ie 255).
    But let's say the user instead does a utf8::upgrade of the string, as per the following:
    use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr($v); utf8::upgrade($str); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # now prints 195.
    The crux of the issue is "what do I (the module author) conclude regarding the expectation of the user that wrote that second block of code ? "

    As I see it, I have only 3 choices:
    a) conclude that the user's expected result is to see an output of "255";
    b) conclude that the user's expected result is to see an output of "195";
    c) conclude that I have insufficient information to know what output the user expects (except that the user will be expecting either "255" or "195").

    Which is the correct conclusion for me to reach ?
    I can accommodate either 'a)', 'b)', or 'c)' and I think the answer is probably 'c)', but I'd just like an informed opinion on that.

    Cheers,
    Rob
      I would very strongly suggest that the user should expect Rmpz_import() to process the series of base-256 "digits" obtained by (map ord($_), split //, $str), regardless of the internal encoding of the string. So (a) is the correct result. (b) is just horrible, and is repeating the broken Unicode model that appeared in perl 5.6 and was (mostly) fixed by perl 5.8.

      Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.)

      Dave.

        Hi Dave,

        If we accept that croaking is acceptable whenever there's a codepoint > 0xff, then I believe that simply replacing SvPV_nolen() with SvPVbyte_nolen() takes care of the points you've raised.
        It looks to me that SvPVbyte_nolen() croaks with "Wide character in subroutine entry" whenever there's a codepoint > 0xff.
        I also considered using SvPVutf8_nolen() for when a codepoint > 0xff is encountered but, with the string "\x40\x{150}\x60", that leads to an integer value of 0x6090c540. It's not apparent to me that there's any value in going down that particular path.

        As roboticus pointed out, there's also the matter of warnings and documentation to attend to.
        I did consider simply croaking if the UTF8 flag is set. Given the mpz_import() spec, I think that could be justified ... but where's the challenge in adopting such a wise and practical solution ;-)

        Anyway ... I think I've got the information I need. It's now just a matter of thinking it through in a sane and orderly fashion.

        Thanks dave_the_m, roboticus and Tux.
        I appreciate not only the fact that you replied, but also the time and effort that was put into composing those replies.

        Cheers,
        Rob

      syphilis:

      I'd expect to see 255, but I wouldn't object to seeing a warning if the SVf_UTF8 flag was set on the input variable. The GMP manual gives enough information for an experienced programmer to see that GMP is expecting a binary vector of fixed-length words to process, and the internal UTF codepoints of perl are clearly not that. So handing a UTF encoded string to that function is at least suspicious.

      I think I'd add a chunk to the modules POD to tell users how to handle UTF strings, and make the module issue a warning if it's presented with a UTF string, so they'd be directed to look at that part of the documentation. You might also modify the $order and/or $endian parameters to give a combination that would let them indicate that you should do the decode for them if they see the UTF string.

      My reasoning is essentially that the GMP documentation for import clearly indicates that we should be treating the data as a vector of fixed-length words, and UTF encoding is *not* that. If we see a UTF flag on a string, I'd expect that *some* conversion happened somewhere (whether intentional or unintentional) such that the oft-assumed1 bytes == characters assumption does not necessarily hold true.

      I often wish that we had a flag on the variables that would let us specify that the buffer holds an exact representation of the bytes that came from the data source, so we could tell when the data was munged. But of course, I have no idea how to define appropriate semantics, as there's no way to get people to agree on the set of cases where we could change the string without turning that flag off (chop, chomp, s///, tr, ....), and/or how to create a string with the flag set appropriately without too much fuss and bother.

      Note 1: Sure it's a bad assumption in many contexts, but many perl-mongers (myself included) do much more binary-processing than processing involving unicode.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.