syphilis has asked for the wisdom of the Perl Monks concerning the following question:
A few days ago I noticed https://rt.cpan.org/Ticket/Display.html?id=123268 which had been sitting there for over 3 years.
I don't have any experience with utf8, but I fiddled around with utf8::upgrade() and utf8::downgrade() and decided that utf8::upgrade was actually altering the bytes of the string - or, to be more precise, altering those bytes whose value was greater than 0x7f.
But now I'm wondering if I was wrong - if, instead, all of the bytes of the string remain unchanged, and it's just the encoding that changes.
If that's so, then the person who filed the report is quite right to be surprised that the utf8::upgrade changed the returned value - and Rmpz_import() should probably change to reading the string in with SvPVbyte_nolen instead of SvPV_nolen.
Explanations that might reduce my confusion are most welcome ;-)
Cheers,
Rob
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: What does utf8::upgrade actually do.
by dave_the_m (Monsignor) on Feb 17, 2021 at 09:04 UTC | |
How perl internally stores those those codepoints is up to perl, and perl-level code mostly needn't care about the difference. XS code on the other hand needs to know about it if its going to start rummaging around accessing the individual bytes making up the string's storage. Currently perl uses two storage formats - traditional one byte per codepoint and utf8 variable length encoding. The encoding is indicated by the SVf_UTF8 flag. You can't guarantee which encoding will be used - that's up to perl. For example at the moment: utf8::upgrade() and utf8::downgrade() are just ways of forcing the encoding format of the internal representation - useful for demonstrating bugs in modules which make assumptions. Note that they don't change the semantics of the string - perl thinks they still have the same length and codepoints. To continue the example above: What an XS module does when it wants to process a list of bytes is of course up to the XS module's author. However, just using the current bytes of the internal representation is probably a poor choice - two strings which are semantically identical at the perl level but have different internal representations will give different results (e.g. the $s1 above initially and the $s2 after the chop()). If there is no sensible interpretation of the meaning of codepoints > 0xff then I would suggest the XS code should check the SVf_UTF8 flag and if present,try to downgrade the string, and if not possible, croak. Dave. | [reply] [d/l] [select] |
by syphilis (Archbishop) on Feb 17, 2021 at 12:34 UTC | |
Ok - but I guess the module author (me) should probably document the procedure that the module takes. (The lack of any such documentation seems to have been a part of ribasushi's objection, and I think that's fair enough.) For simplicity, let's stick to a single-byte string: But let's say the user instead does a utf8::upgrade of the string, as per the following: The crux of the issue is "what do I (the module author) conclude regarding the expectation of the user that wrote that second block of code ? " As I see it, I have only 3 choices: a) conclude that the user's expected result is to see an output of "255"; b) conclude that the user's expected result is to see an output of "195"; c) conclude that I have insufficient information to know what output the user expects (except that the user will be expecting either "255" or "195"). Which is the correct conclusion for me to reach ? I can accommodate either 'a)', 'b)', or 'c)' and I think the answer is probably 'c)', but I'd just like an informed opinion on that. Cheers, Rob | [reply] [d/l] [select] |
by dave_the_m (Monsignor) on Feb 17, 2021 at 13:58 UTC | |
Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.) Dave. | [reply] |
by syphilis (Archbishop) on Feb 18, 2021 at 14:25 UTC | |
by roboticus (Chancellor) on Feb 17, 2021 at 17:20 UTC | |
I'd expect to see 255, but I wouldn't object to seeing a warning if the SVf_UTF8 flag was set on the input variable. The GMP manual gives enough information for an experienced programmer to see that GMP is expecting a binary vector of fixed-length words to process, and the internal UTF codepoints of perl are clearly not that. So handing a UTF encoded string to that function is at least suspicious. I think I'd add a chunk to the modules POD to tell users how to handle UTF strings, and make the module issue a warning if it's presented with a UTF string, so they'd be directed to look at that part of the documentation. You might also modify the $order and/or $endian parameters to give a combination that would let them indicate that you should do the decode for them if they see the UTF string. My reasoning is essentially that the GMP documentation for import clearly indicates that we should be treating the data as a vector of fixed-length words, and UTF encoding is *not* that. If we see a UTF flag on a string, I'd expect that *some* conversion happened somewhere (whether intentional or unintentional) such that the oft-assumed1 bytes == characters assumption does not necessarily hold true. I often wish that we had a flag on the variables that would let us specify that the buffer holds an exact representation of the bytes that came from the data source, so we could tell when the data was munged. But of course, I have no idea how to define appropriate semantics, as there's no way to get people to agree on the set of cases where we could change the string without turning that flag off (chop, chomp, s///, tr, ....), and/or how to create a string with the flag set appropriately without too much fuss and bother. Note 1: Sure it's a bad assumption in many contexts, but many perl-mongers (myself included) do much more binary-processing than processing involving unicode....roboticus When your only tool is a hammer, all problems look like your thumb. | [reply] |
Re: What does utf8::upgrade actually do.
by Tux (Canon) on Feb 18, 2021 at 09:21 UTC | |
Also keep in mind that a perl "string" does not need to be a single encoding for all of its content. Think XML and CSV where parts can be real binary and parts can be encoded. Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption. One more thing to keep in mind with codepoints is that Unicode allows a lot. e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes: (update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)
->
Enjoy, Have FUN! H.Merijn | [reply] [d/l] [select] |
Re: What does utf8::upgrade actually do.
by ikegami (Patriarch) on Feb 19, 2021 at 18:56 UTC | |
Perl has three internal storage formats for numbers: signed integer, unsigned integer and floating point number. Similarly, Perl has two internal storage formats for strings (described below). utf8::is_utf8 identifies the format used, and utf8::upgrade and utf8::downgrade convert how a string is stored internally.
"Downgraded" format Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being clear. Each character (string element) is capable of storing an 8-bit value. Great for bytes. Not so good for text. Each character is stored as a single byte. This allows very efficient access of arbitrary characters and very efficient access of the length of the string (both O(1)). "Upgraded" format Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being set. Each character (string element) is capable of storing a 72-bit value (in theory), a 64-bit value (on builds with uvsize of 8) or a 32-bit value (on builds with uvsize of 4). This is more than enough to store any Unicode Code Point. Each character is stored as its utf8 encoding. utf8 is an proprietary extension of UTF-8. As a variable-length encoding, both accessing arbitrary characters and accessing the length of the string are very inefficient (O(N)), though Perl does attach the length of the string to the scalar when it becomes known, and it even attaches some character positions in some situations. The Unicode Bug Notice how I didn't say format X is used to store Y. That's because Perl imparts no semantics on the choice of storage format. Just like three stored as a signed integer and three stored as a floating point number both refer to the same number, strings consisting of the same characters but stored in different formats are still considered the same string (i.e. eq will return true). However, some code (particularly XS modules, but even some builtin operators) intentionally or inadvertently impart meaning on the choice of internal storage format of strings. Code does that does this is said to be suffering from The Unicode Bug. utf8::upgrade and utf8::downgrade are useful when working with such buggy code. Rmpz_import is such a function. Without knowing the details, switching to SvPVbyte* is a sensible solution. (This would mean you can't receive strings with characters larger than 255, though.) Other options include upgrading the string (SvPVutf8*) and handling both formats (by checking SvUTF8(sv)). Seeking work! You can reach me at ikegami@adaelis.com | [reply] [d/l] [select] |
by ikegami (Patriarch) on Feb 19, 2021 at 19:27 UTC | |
Added to my answer (parent post). In particular, tied it back to Rmpz_import. Seeking work! You can reach me at ikegami@adaelis.com | [reply] [d/l] |