Re: getting Unicode character names from string

in reply to getting Unicode character names from string

Sometimes a very ungeneral example makes a more immediately understandable demo:

#!perl

use v5.14;
use strict;
use warnings;
use charnames qw( :full );
use Unicode::Normalize qw( NFKD );

binmode STDOUT, ':encoding(UTF-8)';

my $greek_small_letter_alpha_with_oxia
    = NFKD("\N{GREEK SMALL LETTER ALPHA WITH OXIA}");
my $greek_small_letter_alpha_with_varia
    = NFKD("\N{GREEK SMALL LETTER ALPHA WITH VARIA}");

my $greek_small_letter_alpha_without_oxia
    = $greek_small_letter_alpha_with_oxia;
my $greek_small_letter_alpha_without_varia
    = $greek_small_letter_alpha_with_varia;

$greek_small_letter_alpha_without_oxia
    =~ s/\p{Nonspacing_Mark}//g;
$greek_small_letter_alpha_without_varia
    =~ s/\p{Nonspacing_Mark}//g;

my $greek_small_letter_alpha_without_oxia_code_point
    = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_oxia;
my $greek_small_letter_alpha_without_varia_code_point
    = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_varia;

my $output = <<END;
\$greek_small_letter_alpha_with_oxia                =
 $greek_small_letter_alpha_with_oxia
\$greek_small_letter_alpha_with_varia               =
 $greek_small_letter_alpha_with_varia
\$greek_small_letter_alpha_without_oxia             =
 $greek_small_letter_alpha_without_oxia
\$greek_small_letter_alpha_without_varia            =
 $greek_small_letter_alpha_without_varia
\$greek_small_letter_alpha_without_oxia_code_point  =
 $greek_small_letter_alpha_without_oxia_code_point
\$greek_small_letter_alpha_without_varia_code_point =
 $greek_small_letter_alpha_without_varia_code_point
END

$output =~ s/(?<==)\n(?= )//g;

print $output;

exit 0;
[download]

This script prints…

$greek_small_letter_alpha_with_oxia                = ά
$greek_small_letter_alpha_with_varia               = ὰ
$greek_small_letter_alpha_without_oxia             = α
$greek_small_letter_alpha_without_varia            = α
$greek_small_letter_alpha_without_oxia_code_point  = U+03b1
$greek_small_letter_alpha_without_varia_code_point = U+03b1

The pattern here is to normalize the graphemes to Unicode NFKD and then strip them of all non-spacing characters. (But see http://stackoverflow.com/questions/5697171/regex-what-is-incombiningdiacriticalmarks for tchrist's much more detailed information about this pattern.)

Comment on Re: getting Unicode character names from string Download Code

In Section Seekers of Perl Wisdom