Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^2: What does utf8::upgrade actually do.

by syphilis (Archbishop)
on Feb 17, 2021 at 12:34 UTC ( #11128492=note: print w/replies, xml ) Need Help??

in reply to Re: What does utf8::upgrade actually do.
in thread What does utf8::upgrade actually do.

What an XS module does when it wants to process a list of bytes is of course up to the XS module's author

Ok - but I guess the module author (me) should probably document the procedure that the module takes. (The lack of any such documentation seems to have been a part of ribasushi's objection, and I think that's fair enough.)

For simplicity, let's stick to a single-byte string:
use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr(ord $v); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # prints the value assigned to $v (ie 255).
But let's say the user instead does a utf8::upgrade of the string, as per the following:
use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr($v); utf8::upgrade($str); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # now prints 195.
The crux of the issue is "what do I (the module author) conclude regarding the expectation of the user that wrote that second block of code ? "

As I see it, I have only 3 choices:
a) conclude that the user's expected result is to see an output of "255";
b) conclude that the user's expected result is to see an output of "195";
c) conclude that I have insufficient information to know what output the user expects (except that the user will be expecting either "255" or "195").

Which is the correct conclusion for me to reach ?
I can accommodate either 'a)', 'b)', or 'c)' and I think the answer is probably 'c)', but I'd just like an informed opinion on that.


Replies are listed 'Best First'.
Re^3: What does utf8::upgrade actually do.
by dave_the_m (Monsignor) on Feb 17, 2021 at 13:58 UTC
    I would very strongly suggest that the user should expect Rmpz_import() to process the series of base-256 "digits" obtained by (map ord($_), split //, $str), regardless of the internal encoding of the string. So (a) is the correct result. (b) is just horrible, and is repeating the broken Unicode model that appeared in perl 5.6 and was (mostly) fixed by perl 5.8.

    Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.)


      Hi Dave,

      If we accept that croaking is acceptable whenever there's a codepoint > 0xff, then I believe that simply replacing SvPV_nolen() with SvPVbyte_nolen() takes care of the points you've raised.
      It looks to me that SvPVbyte_nolen() croaks with "Wide character in subroutine entry" whenever there's a codepoint > 0xff.
      I also considered using SvPVutf8_nolen() for when a codepoint > 0xff is encountered but, with the string "\x40\x{150}\x60", that leads to an integer value of 0x6090c540. It's not apparent to me that there's any value in going down that particular path.

      As roboticus pointed out, there's also the matter of warnings and documentation to attend to.
      I did consider simply croaking if the UTF8 flag is set. Given the mpz_import() spec, I think that could be justified ... but where's the challenge in adopting such a wise and practical solution ;-)

      Anyway ... I think I've got the information I need. It's now just a matter of thinking it through in a sane and orderly fashion.

      Thanks dave_the_m, roboticus and Tux.
      I appreciate not only the fact that you replied, but also the time and effort that was put into composing those replies.

Re^3: What does utf8::upgrade actually do.
by roboticus (Chancellor) on Feb 17, 2021 at 17:20 UTC


    I'd expect to see 255, but I wouldn't object to seeing a warning if the SVf_UTF8 flag was set on the input variable. The GMP manual gives enough information for an experienced programmer to see that GMP is expecting a binary vector of fixed-length words to process, and the internal UTF codepoints of perl are clearly not that. So handing a UTF encoded string to that function is at least suspicious.

    I think I'd add a chunk to the modules POD to tell users how to handle UTF strings, and make the module issue a warning if it's presented with a UTF string, so they'd be directed to look at that part of the documentation. You might also modify the $order and/or $endian parameters to give a combination that would let them indicate that you should do the decode for them if they see the UTF string.

    My reasoning is essentially that the GMP documentation for import clearly indicates that we should be treating the data as a vector of fixed-length words, and UTF encoding is *not* that. If we see a UTF flag on a string, I'd expect that *some* conversion happened somewhere (whether intentional or unintentional) such that the oft-assumed1 bytes == characters assumption does not necessarily hold true.

    I often wish that we had a flag on the variables that would let us specify that the buffer holds an exact representation of the bytes that came from the data source, so we could tell when the data was munged. But of course, I have no idea how to define appropriate semantics, as there's no way to get people to agree on the set of cases where we could change the string without turning that flag off (chop, chomp, s///, tr, ....), and/or how to create a string with the flag set appropriately without too much fuss and bother.

    Note 1: Sure it's a bad assumption in many contexts, but many perl-mongers (myself included) do much more binary-processing than processing involving unicode.


    When your only tool is a hammer, all problems look like your thumb.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11128492]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2022-05-23 13:50 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (82 votes). Check out past polls.