in reply to What does utf8::upgrade actually do.

Perl has three internal storage formats for numbers: signed integer, unsigned integer and floating point number.

Similarly, Perl has two internal storage formats for strings (described below).

utf8::is_utf8 identifies the format used, and utf8::upgrade and utf8::downgrade convert how a string is stored internally.

use Devel::Peek qw( Dump ); my $s = chr(0xE9); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9 utf8::upgrade($s); say length($s); # 1 The string hasn't changed say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 1 But it's now stored differently. Dump($s); # PV contains C3 A9 utf8::downgrade($s); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9

"Downgraded" format

Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being clear.

Each character (string element) is capable of storing an 8-bit value.

Great for bytes. Not so good for text.

Each character is stored as a single byte. This allows very efficient access of arbitrary characters and very efficient access of the length of the string (both O(1)).


"Upgraded" format

Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being set.

Each character (string element) is capable of storing a 72-bit value (in theory), a 64-bit value (on builds with uvsize of 8) or a 32-bit value (on builds with uvsize of 4).

This is more than enough to store any Unicode Code Point.

Each character is stored as its utf8 encoding. utf8 is an proprietary extension of UTF-8. As a variable-length encoding, both accessing arbitrary characters and accessing the length of the string are very inefficient (O(N)), though Perl does attach the length of the string to the scalar when it becomes known, and it even attaches some character positions in some situations.


The Unicode Bug

Notice how I didn't say format X is used to store Y. That's because Perl imparts no semantics on the choice of storage format. Just like three stored as a signed integer and three stored as a floating point number both refer to the same number, strings consisting of the same characters but stored in different formats are still considered the same string (i.e. eq will return true).

However, some code (particularly XS modules, but even some builtin operators) intentionally or inadvertently impart meaning on the choice of internal storage format of strings. Code does that does this is said to be suffering from The Unicode Bug. utf8::upgrade and utf8::downgrade are useful when working with such buggy code.

Rmpz_import is such a function. Without knowing the details, switching to SvPVbyte* is a sensible solution. (This would mean you can't receive strings with characters larger than 255, though.) Other options include upgrading the string (SvPVutf8*) and handling both formats (by checking SvUTF8(sv)).

Seeking work! You can reach me at ikegami@adaelis.com

Replies are listed 'Best First'.
Re^2: What does utf8::upgrade actually do.
by ikegami (Pope) on Feb 19, 2021 at 19:27 UTC

    Added to my answer (parent post). In particular, tied it back to Rmpz_import.

    Seeking work! You can reach me at ikegami@adaelis.com