Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: getting Unicode character names from string

by Jim (Curate)
on Oct 11, 2012 at 01:30 UTC ( #998348=note: print w/ replies, xml ) Need Help??


in reply to getting Unicode character names from string

Sometimes a very ungeneral example makes a more immediately understandable demo:

#!perl use v5.14; use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize qw( NFKD ); binmode STDOUT, ':encoding(UTF-8)'; my $greek_small_letter_alpha_with_oxia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH OXIA}"); my $greek_small_letter_alpha_with_varia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH VARIA}"); my $greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_with_oxia; my $greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_with_varia; $greek_small_letter_alpha_without_oxia =~ s/\p{Nonspacing_Mark}//g; $greek_small_letter_alpha_without_varia =~ s/\p{Nonspacing_Mark}//g; my $greek_small_letter_alpha_without_oxia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_oxia; my $greek_small_letter_alpha_without_varia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_varia; my $output = <<END; \$greek_small_letter_alpha_with_oxia = $greek_small_letter_alpha_with_oxia \$greek_small_letter_alpha_with_varia = $greek_small_letter_alpha_with_varia \$greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_without_oxia \$greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_without_varia \$greek_small_letter_alpha_without_oxia_code_point = $greek_small_letter_alpha_without_oxia_code_point \$greek_small_letter_alpha_without_varia_code_point = $greek_small_letter_alpha_without_varia_code_point END $output =~ s/(?<==)\n(?= )//g; print $output; exit 0;

This script prints…

$greek_small_letter_alpha_with_oxia                = ά
$greek_small_letter_alpha_with_varia               = ὰ
$greek_small_letter_alpha_without_oxia             = α
$greek_small_letter_alpha_without_varia            = α
$greek_small_letter_alpha_without_oxia_code_point  = U+03b1
$greek_small_letter_alpha_without_varia_code_point = U+03b1

The pattern here is to normalize the graphemes to Unicode NFKD and then strip them of all non-spacing characters. (But see http://stackoverflow.com/questions/5697171/regex-what-is-incombiningdiacriticalmarks for tchrist's much more detailed information about this pattern.)


Comment on Re: getting Unicode character names from string
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (10)
As of 2014-12-29 16:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (193 votes), past polls