Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

What does utf8::upgrade actually do.

by syphilis (Archbishop)
on Feb 17, 2021 at 06:02 UTC ( #11128486=perlquestion: print w/replies, xml ) Need Help??

syphilis has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

A few days ago I noticed https://rt.cpan.org/Ticket/Display.html?id=123268 which had been sitting there for over 3 years.
I don't have any experience with utf8, but I fiddled around with utf8::upgrade() and utf8::downgrade() and decided that utf8::upgrade was actually altering the bytes of the string - or, to be more precise, altering those bytes whose value was greater than 0x7f.
But now I'm wondering if I was wrong - if, instead, all of the bytes of the string remain unchanged, and it's just the encoding that changes.
If that's so, then the person who filed the report is quite right to be surprised that the utf8::upgrade changed the returned value - and Rmpz_import() should probably change to reading the string in with SvPVbyte_nolen instead of SvPV_nolen.

Explanations that might reduce my confusion are most welcome ;-)

Cheers,
Rob

Replies are listed 'Best First'.
Re: What does utf8::upgrade actually do.
by dave_the_m (Monsignor) on Feb 17, 2021 at 09:04 UTC
    Perl's strings are formally list of codepoints. So the following applies:
    @codepoints = map ord($_), split //, $s1; $s2 = join '', map chr($_), @codepoints; ok($s1 eq $s2); ok(length($s1) == length($s2);
    How perl internally stores those those codepoints is up to perl, and perl-level code mostly needn't care about the difference. XS code on the other hand needs to know about it if its going to start rummaging around accessing the individual bytes making up the string's storage. Currently perl uses two storage formats - traditional one byte per codepoint and utf8 variable length encoding. The encoding is indicated by the SVf_UTF8 flag. You can't guarantee which encoding will be used - that's up to perl. For example at the moment:
    $s1 = "abc\x80"; $s2 = $s1; # currently SVf_UTF8 not set; string uses 4 bytes of + storage $s2 .= "\x{100}"; # currently perl upgrades to SVf_UTF8 and converts t +he 0x80 and 0x100 into multi-byte representations chop($s2); # currently perl doesn't downgrade; the 0x80 codepoi +nt still stored as 2 bytes ok($s1 eq $s2); ok(length($s1) == length($s2));
    utf8::upgrade() and utf8::downgrade() are just ways of forcing the encoding format of the internal representation - useful for demonstrating bugs in modules which make assumptions. Note that they don't change the semantics of the string - perl thinks they still have the same length and codepoints. To continue the example above:
    utf8::downgrade($s2); # the the 0x80 codepoint now stored as 1 byte ok(length($s1) == $length($s2)); ok($s1 eq $s2);
    What an XS module does when it wants to process a list of bytes is of course up to the XS module's author. However, just using the current bytes of the internal representation is probably a poor choice - two strings which are semantically identical at the perl level but have different internal representations will give different results (e.g. the $s1 above initially and the $s2 after the chop()). If there is no sensible interpretation of the meaning of codepoints > 0xff then I would suggest the XS code should check the SVf_UTF8 flag and if present,try to downgrade the string, and if not possible, croak.

    Dave.

      What an XS module does when it wants to process a list of bytes is of course up to the XS module's author

      Ok - but I guess the module author (me) should probably document the procedure that the module takes. (The lack of any such documentation seems to have been a part of ribasushi's objection, and I think that's fair enough.)

      For simplicity, let's stick to a single-byte string:
      use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr(ord $v); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # prints the value assigned to $v (ie 255).
      But let's say the user instead does a utf8::upgrade of the string, as per the following:
      use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr($v); utf8::upgrade($str); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # now prints 195.
      The crux of the issue is "what do I (the module author) conclude regarding the expectation of the user that wrote that second block of code ? "

      As I see it, I have only 3 choices:
      a) conclude that the user's expected result is to see an output of "255";
      b) conclude that the user's expected result is to see an output of "195";
      c) conclude that I have insufficient information to know what output the user expects (except that the user will be expecting either "255" or "195").

      Which is the correct conclusion for me to reach ?
      I can accommodate either 'a)', 'b)', or 'c)' and I think the answer is probably 'c)', but I'd just like an informed opinion on that.

      Cheers,
      Rob
        I would very strongly suggest that the user should expect Rmpz_import() to process the series of base-256 "digits" obtained by (map ord($_), split //, $str), regardless of the internal encoding of the string. So (a) is the correct result. (b) is just horrible, and is repeating the broken Unicode model that appeared in perl 5.6 and was (mostly) fixed by perl 5.8.

        Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.)

        Dave.

        syphilis:

        I'd expect to see 255, but I wouldn't object to seeing a warning if the SVf_UTF8 flag was set on the input variable. The GMP manual gives enough information for an experienced programmer to see that GMP is expecting a binary vector of fixed-length words to process, and the internal UTF codepoints of perl are clearly not that. So handing a UTF encoded string to that function is at least suspicious.

        I think I'd add a chunk to the modules POD to tell users how to handle UTF strings, and make the module issue a warning if it's presented with a UTF string, so they'd be directed to look at that part of the documentation. You might also modify the $order and/or $endian parameters to give a combination that would let them indicate that you should do the decode for them if they see the UTF string.

        My reasoning is essentially that the GMP documentation for import clearly indicates that we should be treating the data as a vector of fixed-length words, and UTF encoding is *not* that. If we see a UTF flag on a string, I'd expect that *some* conversion happened somewhere (whether intentional or unintentional) such that the oft-assumed1 bytes == characters assumption does not necessarily hold true.

        I often wish that we had a flag on the variables that would let us specify that the buffer holds an exact representation of the bytes that came from the data source, so we could tell when the data was munged. But of course, I have no idea how to define appropriate semantics, as there's no way to get people to agree on the set of cases where we could change the string without turning that flag off (chop, chomp, s///, tr, ....), and/or how to create a string with the flag set appropriately without too much fuss and bother.

        Note 1: Sure it's a bad assumption in many contexts, but many perl-mongers (myself included) do much more binary-processing than processing involving unicode.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: What does utf8::upgrade actually do.
by Tux (Canon) on Feb 18, 2021 at 09:21 UTC

    Also keep in mind that a perl "string" does not need to be a single encoding for all of its content.

    Think XML and CSV where parts can be real binary and parts can be encoded.

    Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption.

    One more thing to keep in mind with codepoints is that Unicode allows a lot.

    e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes:

    (update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)

    #!/usr/bin/perl use 5.18.2; use warnings; use Data::Peek; use Unicode::Normalize qw( normalize ); use Encode qw( encode decode ); use charnames qw(:full); sub dp { my ($tag, $dta) = @_; my $dp = DPeek ($dta); printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length +$1)}er; utf8::is_utf8 ($dta) and print join " + " => map { charnames::viacode (ord) } split // +=> $dta; say ""; } # dp $| = 1; foreach my $bytes ( "\xe1\xb8\xaf", "\xc3\xaf\xcc\x81", "\xc3\xad\xcc\x88", "\x69\xcc\x81\xcc\x88", "\x69\xcc\x88\xcc\x81", ) { my $u = decode ("utf-8", $bytes); dp ("Bytes", $bytes); dp ("UTF-8", $u); dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC ); say ""; }

    ->

    Bytes : PV("\341\270\257"\0) UTF-8 : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\257\314\201"\0) UTF-8 : PV("\303\257\314\201"\0) [UTF8 "\x{ef}\x{301}"] LATIN SMAL +L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\255\314\210"\0) UTF-8 : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\201\314\210"\0) UTF-8 : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\210\314\201"\0) UTF-8 : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE

    Enjoy, Have FUN! H.Merijn
Re: What does utf8::upgrade actually do.
by ikegami (Patriarch) on Feb 19, 2021 at 18:56 UTC

    Perl has three internal storage formats for numbers: signed integer, unsigned integer and floating point number.

    Similarly, Perl has two internal storage formats for strings (described below).

    utf8::is_utf8 identifies the format used, and utf8::upgrade and utf8::downgrade convert how a string is stored internally.

    use Devel::Peek qw( Dump ); my $s = chr(0xE9); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9 utf8::upgrade($s); say length($s); # 1 The string hasn't changed say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 1 But it's now stored differently. Dump($s); # PV contains C3 A9 utf8::downgrade($s); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9

    "Downgraded" format

    Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being clear.

    Each character (string element) is capable of storing an 8-bit value.

    Great for bytes. Not so good for text.

    Each character is stored as a single byte. This allows very efficient access of arbitrary characters and very efficient access of the length of the string (both O(1)).


    "Upgraded" format

    Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being set.

    Each character (string element) is capable of storing a 72-bit value (in theory), a 64-bit value (on builds with uvsize of 8) or a 32-bit value (on builds with uvsize of 4).

    This is more than enough to store any Unicode Code Point.

    Each character is stored as its utf8 encoding. utf8 is an proprietary extension of UTF-8. As a variable-length encoding, both accessing arbitrary characters and accessing the length of the string are very inefficient (O(N)), though Perl does attach the length of the string to the scalar when it becomes known, and it even attaches some character positions in some situations.


    The Unicode Bug

    Notice how I didn't say format X is used to store Y. That's because Perl imparts no semantics on the choice of storage format. Just like three stored as a signed integer and three stored as a floating point number both refer to the same number, strings consisting of the same characters but stored in different formats are still considered the same string (i.e. eq will return true).

    However, some code (particularly XS modules, but even some builtin operators) intentionally or inadvertently impart meaning on the choice of internal storage format of strings. Code does that does this is said to be suffering from The Unicode Bug. utf8::upgrade and utf8::downgrade are useful when working with such buggy code.

    Rmpz_import is such a function. Without knowing the details, switching to SvPVbyte* is a sensible solution. (This would mean you can't receive strings with characters larger than 255, though.) Other options include upgrading the string (SvPVutf8*) and handling both formats (by checking SvUTF8(sv)).

    Seeking work! You can reach me at ikegami@adaelis.com

      Added to my answer (parent post). In particular, tied it back to Rmpz_import.

      Seeking work! You can reach me at ikegami@adaelis.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11128486]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2022-05-25 13:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (90 votes). Check out past polls.

    Notices?