Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: getting Unicode character names from string

by Jim (Curate)
on Oct 11, 2012 at 01:30 UTC ( #998348=note: print w/ replies, xml ) Need Help??


in reply to getting Unicode character names from string

Sometimes a very ungeneral example makes a more immediately understandable demo:

#!perl use v5.14; use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize qw( NFKD ); binmode STDOUT, ':encoding(UTF-8)'; my $greek_small_letter_alpha_with_oxia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH OXIA}"); my $greek_small_letter_alpha_with_varia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH VARIA}"); my $greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_with_oxia; my $greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_with_varia; $greek_small_letter_alpha_without_oxia =~ s/\p{Nonspacing_Mark}//g; $greek_small_letter_alpha_without_varia =~ s/\p{Nonspacing_Mark}//g; my $greek_small_letter_alpha_without_oxia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_oxia; my $greek_small_letter_alpha_without_varia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_varia; my $output = <<END; \$greek_small_letter_alpha_with_oxia = $greek_small_letter_alpha_with_oxia \$greek_small_letter_alpha_with_varia = $greek_small_letter_alpha_with_varia \$greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_without_oxia \$greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_without_varia \$greek_small_letter_alpha_without_oxia_code_point = $greek_small_letter_alpha_without_oxia_code_point \$greek_small_letter_alpha_without_varia_code_point = $greek_small_letter_alpha_without_varia_code_point END $output =~ s/(?<==)\n(?= )//g; print $output; exit 0;

This script prints…

$greek_small_letter_alpha_with_oxia                = ά
$greek_small_letter_alpha_with_varia               = ὰ
$greek_small_letter_alpha_without_oxia             = α
$greek_small_letter_alpha_without_varia            = α
$greek_small_letter_alpha_without_oxia_code_point  = U+03b1
$greek_small_letter_alpha_without_varia_code_point = U+03b1

The pattern here is to normalize the graphemes to Unicode NFKD and then strip them of all non-spacing characters. (But see http://stackoverflow.com/questions/5697171/regex-what-is-incombiningdiacriticalmarks for tchrist's much more detailed information about this pattern.)


Comment on Re: getting Unicode character names from string
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2015-07-06 03:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (70 votes), past polls