Re: getting Unicode character names from string

Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).

In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.

I believe the example code below answers both of your questions:

# If using a Perl version prior to v5.16, comment out the "use feature
+" line,
# and uncomment the BEGIN{...} block.

use feature ':5.16';

#BEGIN {
#  die "Must install Unicode::CaseFold." if ! eval "use Unicode::CaseF
+old; 1;";
#}

use strict;
use warnings FATAL => 'utf8';
use utf8;
use charnames ':full';

use Unicode::Normalize qw(NFD NFC);


binmode STDOUT, ':encoding(UTF-8)';


my $string = "&#955;&#945;&#8049;&#8048;&#8056;&#962;";

while ( $string =~ m/(?<grapheme>\X)/g ) {
  my $grapheme  = $+{grapheme};
  print explain( $+{grapheme} ), "\n";
}

sub explain {
  my $grapheme = shift;
  my %pri = decompose( $grapheme );
  my %base = decompose( $pri{base} );
  my $output = <<"END_OUTPUT";
Grapheme:($grapheme)
    Dec, Hex, Name:           [$pri{cp}], [$pri{hex_str}], '$pri{name}
+'
    Case: (Fold,Lower,Upper): ($pri{fc}), ($pri{lc}), ($pri{uc})
    Grapheme Base:            ($pri{base}), [$base{hex_str}], '$base{n
+ame}'
END_OUTPUT
  foreach my $extend ( @{$pri{comb}} ) {
    my %ext = decompose( $extend );
    my $grapheme = fc $ext{grapheme};
    $output .= <<"END_OUTPUT";
    Combining Mark: ($grapheme )
        Dec, Hex, Name: [$ext{cp}], [$ext{hex_str}], '$ext{name}'
END_OUTPUT
  }
  return $output;
}

sub decompose {
  my $grapheme = shift;
  my $decomp   = NFD( $grapheme );
  my $cp       = ord $grapheme;
  my ( $base ) = substr($decomp, 0, 1 );
  my ( @comb ) = map { substr $decomp, $_, 1 } 1 .. length($decomp)-1;
  return (
    grapheme => $grapheme,
    cp       => $cp,
    hex_str  => sprintf( "%#0.4x", $cp ),
    name     => charnames::viacode( $cp ),
    lc       => lc $grapheme,
    uc       => uc $grapheme,
    fc       => fc $grapheme,
    base     => $base,
    comb     => [ @comb ],
  );
}
[download]

I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.

The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.

The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.

If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

Dave

Comment on Re: getting Unicode character names from string Select or Download Code

Replies are listed 'Best First'.
Re^2: getting Unicode character names from string by Tux (Canon) on Oct 11, 2012 at 06:41 UTC
These are the rare instances where you have to revert to using `<pre>` tags instead of `<code>` tags. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: getting Unicode character names from string by Jim (Curate) on Oct 11, 2012 at 18:23 UTC
If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking". Are there clear examples of how to do this in the documentation? Or elsewhere on the interwebs? Nothing's jumping out at me when I look/search. Thanks!	[reply]
Re^3: getting Unicode character names from string by davido (Cardinal) on Oct 11, 2012 at 20:51 UTC
The synopsis for Unicode::Collate does a reasonable job of setting the stage, but there is a nice discussion in chapter 6 of Programming Perl (the camel book), 4th edition as well. You might also look at the Unicode Technical Standard #10: Unicode Collation Algorithm. Here's a brief example of doing comparisons at a lower (more relaxed) level using Unicode::Collate. use strict; use warnings FATAL => 'utf8'; use utf8; use Unicode::Collate; binmode STDOUT, ':encoding(UTF-8)'; my( $x, $y, $z ) = qw( α ά ὰ ); my $c = Unicode::Collate->new; print "\nStrict collation rules: Level 4 (default)\n"; print "\t cmp('α','ά'): ", $c->cmp( $x, $y ), "\n"; print "\t cmp('ά','ὰ'): ", $c->cmp( $y, $z ), "\n"; print "\t cmp('α','ὰ'): ", $c->cmp( $x, $z ), "\n"; my $rc = Unicode::Collate->new( level => 1 ); print "\nRelaxed collation rules: Level 1\n"; print "\t cmp('α','ά'): ", $rc->cmp( $x, $y ), "\n"; print "\t cmp('ά','ὰ'): ", $rc->cmp( $y, $z ), "\n"; print "\t cmp('α','ὰ'): ", $rc->cmp( $x, $z ), "\n\n"; And the output... Strict collation rules: Level 4 (default) cmp('α','ά'): -1 cmp('ά','ὰ'): -1 cmp('α','ὰ'): -1 Relaxed collation rules: Level 1 cmp('α','ά'): 0 cmp('ά','ὰ'): 0 cmp('α','ὰ'): 0 And if the reason for doing comparisons is to handle sorting, Unicode::Collate does that too (you don't need to explicitly use Perl's core sort). Dave	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks