raygun has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Monks.

Unicode has precomposed fractions, such as VULGAR FRACTION THREE EIGHTHS (U+215C), that Unicode::Normalize's NFKC or NFKD function decomposes into the string "3\N{FRACTION SLASH}8".

But I can't get any function in that module to go the other way, turning the decomposed form back into the vulgar fraction. Clearly the module is aware of the equivalency of these forms. And while it wouldn't be difficult to write my own function to handle these fractions (there are only a dozen or so), rolling my own code to do a translation that already lives inside Unicode::Normalize seems like a wrongheaded approach. Is there a standard mechanism or tool for composing arbitrary fractions into precomposed ones, where such precomposed ones are available?

Replies are listed 'Best First'.
Re: Unicode vulgar fraction composition
by ikegami (Pope) on Sep 24, 2020 at 02:14 UTC

    The "C" and "D" transformations are inverse of each other, but there's no inverse to "K".

    It's a destructive transformation. For example, both the ANGSTROM SIGN (Å) and the LATIN CAPITAL LETTER A WITH RING ABOVE (Å) are independent symbols with distinct meanings, but they have the same KC form and KD form. There's no way to know how to reverse the transformation to restore the original meaning.

      One way of thinking about it, in a simplified ASCII world, would be if you lowercased words to do a case comparison:

      chomp( my $name = lc <$fh> ); if ( $name eq 'bob jones' ) { die 'rejecting annoying person'; } # Now I want to restore $name to its original mixture of upper and l +ower case

        Good analogy (though you really want fc instead of lc to perform a case-insensitive comparison).

        Sure, I think it's intuitive why lc('Boaty McBoat') is conceptually a "lossy" transformation (in terms of being able to restore the original string).

        But NFKC("\N{VULGAR FRACTION THREE EIGHTHS}") is conceptually "lossless": there is only one Unicode character the resultant string "3\N{FRACTION SLASH}8" could be "composed" into.

        As I wrote, I get now why NFKC is conceptually lossy in general. But—unlike with lc—some specific decompositions are exceptions.

Re: Unicode vulgar fraction composition
by kcott (Bishop) on Sep 24, 2020 at 08:22 UTC

    G'day raygun,

    As ++ikegami has explained, and you have accepted, there is no compatibility composition.

    You have asked about modules. There are some available but I can't say whether they are suitable for your purposes (as you haven't explained that part). Here's a couple. If these aren't suitable, search MetaCPAN using terms reflecting your use case.

    • HTML::Fraction may do what you want if you're working with HTML.
    • Unicode::Fraction will render any vulgar fraction into something that's intended to look like a Unicode fraction: 12345/67890 becomes something like 12345/67890 (that's just a rough approximation).
    "... write my own function to handle these fractions (there are only a dozen or so), ..."

    There's actually 18 in total. Three have the codepoints U+00BC - U+00BE and can be found in the PDF Code Chart "C1 Controls and Latin-1 Supplement". The other 15 have the codepoints U+2150 - U+215E and can be found in the PDF Code Chart "Number Forms".

    Writing your own function is pretty easy. I wrote one just for the fun of it: I've put it in a spoiler so as not to spoil your fun if you wanted to do this, but do feel free to look and take any code or ideas you want.

    — Ken

      Thanks much, ikegami and Ken, for the additional explanations and ideas. Super helpful!

      Expanding on ikegami's explanation of why a "compatibility composition" might be ambiguous: I also see that, for instance, U+2168 ROMAN NUMERAL NINE has a compatibility decomposition into the capital letters "I" and "X," but even if no other Unicode character has that particular decomposition, that certainly doesn't mean that any "I" followed by an "X" represents the roman numeral and should thus be "compatibility composed" into it.

      So yes, I see now why the concept is fraught with peril—in general. But a string like "3\N{FRACTION SLASH}8" seems to have an unambiguous meaning that is always equivalent to U+215C VULGAR FRACTION THREE EIGHTHS. So it seems a compatibility_compose_where_it_makes_sense() function could be written. But it would require judgment calls for every possible "compatibility composition," potentially not all of which would be clear-cut, so I can see why no one's rushing to implement it.

Re: Unicode vulgar fraction composition
by Anonymous Monk on Sep 23, 2020 at 12:34 UTC
    I think you'll have to DIY. The module is implementing unicode.org algorithm, see references in docs. In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
      Right, which is why, on the surface, it's curious that both NFKD and NFKC return decomposed forms. Some illumination is provided in the documentation: NFKD performs "compatibility decomposition," while NFKC performs "compatibility decomposition followed by canonical composition." So apparently what I want is "compatibility composition," which it seems nothing in that module performs. Thus my question amounts to: does anything else in Perl do compatibility composition?
      > I think ...

      And I think you are just restating the obvious from the OP.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery