Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: What does utf8::upgrade actually do.

by dave_the_m (Monsignor)
on Feb 17, 2021 at 13:58 UTC ( [id://11128494]=note: print w/replies, xml ) Need Help??


in reply to Re^2: What does utf8::upgrade actually do.
in thread What does utf8::upgrade actually do.

I would very strongly suggest that the user should expect Rmpz_import() to process the series of base-256 "digits" obtained by (map ord($_), split //, $str), regardless of the internal encoding of the string. So (a) is the correct result. (b) is just horrible, and is repeating the broken Unicode model that appeared in perl 5.6 and was (mostly) fixed by perl 5.8.

Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.)

Dave.

  • Comment on Re^3: What does utf8::upgrade actually do.

Replies are listed 'Best First'.
Re^4: What does utf8::upgrade actually do.
by syphilis (Archbishop) on Feb 18, 2021 at 14:25 UTC
    Hi Dave,

    If we accept that croaking is acceptable whenever there's a codepoint > 0xff, then I believe that simply replacing SvPV_nolen() with SvPVbyte_nolen() takes care of the points you've raised.
    It looks to me that SvPVbyte_nolen() croaks with "Wide character in subroutine entry" whenever there's a codepoint > 0xff.
    I also considered using SvPVutf8_nolen() for when a codepoint > 0xff is encountered but, with the string "\x40\x{150}\x60", that leads to an integer value of 0x6090c540. It's not apparent to me that there's any value in going down that particular path.

    As roboticus pointed out, there's also the matter of warnings and documentation to attend to.
    I did consider simply croaking if the UTF8 flag is set. Given the mpz_import() spec, I think that could be justified ... but where's the challenge in adopting such a wise and practical solution ;-)

    Anyway ... I think I've got the information I need. It's now just a matter of thinking it through in a sane and orderly fashion.

    Thanks dave_the_m, roboticus and Tux.
    I appreciate not only the fact that you replied, but also the time and effort that was put into composing those replies.

    Cheers,
    Rob

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11128494]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-12-09 23:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (55 votes). Check out past polls.