in reply to Unicode vulgar fraction composition

The "C" and "D" transformations are inverse of each other, but there's no inverse to "K".

It's a destructive transformation. For example, both the ANGSTROM SIGN (Å) and the LATIN CAPITAL LETTER A WITH RING ABOVE (Å) are independent symbols with distinct meanings, but they have the same KC form and KD form. There's no way to know how to reverse the transformation to restore the original meaning.

• Comment on Re: Unicode vulgar fraction composition

Replies are listed 'Best First'.
Re^2: Unicode vulgar fraction composition
by tobyink (Canon) on Sep 26, 2020 at 09:55 UTC

One way of thinking about it, in a simplified ASCII world, would be if you lowercased words to do a case comparison:

```  chomp( my \$name = lc <\$fh> );

if ( \$name eq 'bob jones' ) {
die 'rejecting annoying person';
}

# Now I want to restore \$name to its original mixture of upper and l
+ower case

Good analogy (though you really want fc instead of lc to perform a case-insensitive comparison).

For ASCII, fc does the same thing as lc though. And I specified ASCII for that reason.

Sure, I think it's intuitive why lc('Boaty McBoat') is conceptually a "lossy" transformation (in terms of being able to restore the original string).

But NFKC("\N{VULGAR FRACTION THREE EIGHTHS}") is conceptually "lossless": there is only one Unicode character the resultant string "3\N{FRACTION SLASH}8" could be "composed" into.

As I wrote, I get now why NFKC is conceptually lossy in general. But—unlike with lc—some specific decompositions are exceptions.

consider:
• 123\N{FRACTION SLASH}8
• 12\N{VULGAR FRACTION THREE EIGHTHS}
I would read the former as "one hundred twenty three eights", but the latter as "twelve (plus) three eights", so it's not completely a one-to-one relationship.

There's no way to know that 3/8 means three-eights. For example, it could mean March 8th. As such there are two possible compositions for 3/8: VULGAR FRACTION THREE EIGHTHS and 3/8.