|Perl: the Markov chain saw|
Re: getting Unicode character names from stringby davido (Archbishop)
|on Oct 10, 2012 at 23:28 UTC||Need Help??|
Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).
In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.
I believe the example code below answers both of your questions:
I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.
The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.
The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.
If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".