Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

getting Unicode character names from string

by csthflk (Novice)
on Oct 10, 2012 at 18:46 UTC ( #998286=perlquestion: print w/ replies, xml ) Need Help??
csthflk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
First post, long time Perl user.

I've been working on a project that translates certain material in Greek to Unicode. There are some idiosyncrasies in the data that I'm trying to isolate. I've used charnames and full character names to generate Unicode, now I need to be able to go in reverse: I need to evaluate each character of a string and obtain the full Unicode name for it.

Secondly, is there a way to see if two characters match the same basic alphabetic character, disregarding any accents or other non-critical marks? For instance, I would like for the program to be able to regard
GREEK SMALL LETTER ALPHA WITH OXIA
GREEK SMALL LETTER ALPHA WITH VARIA
as matching each other in a loose sense, since they are both small letter alphas. I suppose if my first question is answered I could simply strip " WITH.*" and compare the character names, but I wondered if there was another way to check for loose matches.

Thanks for any help.

Jason

Comment on getting Unicode character names from string
Re: getting Unicode character names from string
by McA (Curate) on Oct 10, 2012 at 18:57 UTC

    Hi!

    Look at this http://perldoc.perl.org/charnames.html

    I'm pretty sure you find the answer to your first question.

    Best regards
    McA

      Thanks for the reply McA, but unfortunately the answer is not there. I actually did read that page before posting, and again more closely after your message. I see that I can get the name if I pass the numeric code of the Unicode character to charnames::viacode(), but I do not know how to obtain the numeric code at runtime.

      So I can match named characters, and write named characters, but I can't yet get the names from characters that are already in Unicode.

      Jason

        Most likely, Unicode::Tussle contains something you can use, or use the source of. Most likely the uniprops program.

Re: getting Unicode character names from string
by csthflk (Novice) on Oct 10, 2012 at 19:53 UTC
    I was able to hit on a solution to the first question with some help from Google.

    This test code works well, although I confess that I don't fully yet understand the codepoint_hex method that gets me the code point I need:
    while($testStr =~ m/(.)/g) { $string = pad(codepoint_hex($1)); print("$string\n"); print charnames::viacode($string) . "\n"; } sub codepoint_hex { if (my $char = shift) { return sprintf '%2.2x', unpack('U0U*', $char); } } sub pad { my $str = shift; while (length $str < 4) { $str = "0$str"; } return "0x$str"; }
    Result:
    0x03bb
    GREEK SMALL LETTER LAMDA
    0x03b1
    GREEK SMALL LETTER ALPHA
    0x1f78
    GREEK SMALL LETTER OMICRON WITH VARIA
    0x03c2
    GREEK SMALL LETTER FINAL SIGMA


    Jason
Re: getting Unicode character names from string
by davido (Archbishop) on Oct 10, 2012 at 23:28 UTC

    Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).

    In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.

    I believe the example code below answers both of your questions:

    # If using a Perl version prior to v5.16, comment out the "use feature +" line, # and uncomment the BEGIN{...} block. use feature ':5.16'; #BEGIN { # die "Must install Unicode::CaseFold." if ! eval "use Unicode::CaseF +old; 1;"; #} use strict; use warnings FATAL => 'utf8'; use utf8; use charnames ':full'; use Unicode::Normalize qw(NFD NFC); binmode STDOUT, ':encoding(UTF-8)'; my $string = "&#955;&#945;&#8049;&#8048;&#8056;&#962;"; while ( $string =~ m/(?<grapheme>\X)/g ) { my $grapheme = $+{grapheme}; print explain( $+{grapheme} ), "\n"; } sub explain { my $grapheme = shift; my %pri = decompose( $grapheme ); my %base = decompose( $pri{base} ); my $output = <<"END_OUTPUT"; Grapheme:($grapheme) Dec, Hex, Name: [$pri{cp}], [$pri{hex_str}], '$pri{name} +' Case: (Fold,Lower,Upper): ($pri{fc}), ($pri{lc}), ($pri{uc}) Grapheme Base: ($pri{base}), [$base{hex_str}], '$base{n +ame}' END_OUTPUT foreach my $extend ( @{$pri{comb}} ) { my %ext = decompose( $extend ); my $grapheme = fc $ext{grapheme}; $output .= <<"END_OUTPUT"; Combining Mark: ($grapheme ) Dec, Hex, Name: [$ext{cp}], [$ext{hex_str}], '$ext{name}' END_OUTPUT } return $output; } sub decompose { my $grapheme = shift; my $decomp = NFD( $grapheme ); my $cp = ord $grapheme; my ( $base ) = substr($decomp, 0, 1 ); my ( @comb ) = map { substr $decomp, $_, 1 } 1 .. length($decomp)-1; return ( grapheme => $grapheme, cp => $cp, hex_str => sprintf( "%#0.4x", $cp ), name => charnames::viacode( $cp ), lc => lc $grapheme, uc => uc $grapheme, fc => fc $grapheme, base => $base, comb => [ @comb ], ); }

    I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.

    The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.

    The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.

    If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".


    Dave

      These are the rare instances where you have to revert to using <pre> tags instead of <code> tags.


      Enjoy, Have FUN! H.Merijn
      If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

      Are there clear examples of how to do this in the documentation? Or elsewhere on the interwebs? Nothing's jumping out at me when I look/search. Thanks!

        The synopsis for Unicode::Collate does a reasonable job of setting the stage, but there is a nice discussion in chapter 6 of Programming Perl (the camel book), 4th edition as well. You might also look at the Unicode Technical Standard #10: Unicode Collation Algorithm.

        Here's a brief example of doing comparisons at a lower (more relaxed) level using Unicode::Collate.

        use strict;
        use warnings FATAL => 'utf8';
        use utf8;
        
        use Unicode::Collate;
        binmode STDOUT, ':encoding(UTF-8)';
        
        my( $x, $y, $z ) = qw( α ά ὰ );
        
        my $c = Unicode::Collate->new;
        
        print "\nStrict collation rules: Level 4 (default)\n";
        print "\t cmp('α','ά'): ", $c->cmp( $x, $y ), "\n";
        print "\t cmp('ά','ὰ'): ", $c->cmp( $y, $z ), "\n";
        print "\t cmp('α','ὰ'): ", $c->cmp( $x, $z ), "\n";
        
        my $rc = Unicode::Collate->new( level => 1 );
        
        print "\nRelaxed collation rules: Level 1\n";
        print "\t cmp('α','ά'): ", $rc->cmp( $x, $y ), "\n";
        print "\t cmp('ά','ὰ'): ", $rc->cmp( $y, $z ), "\n";
        print "\t cmp('α','ὰ'): ", $rc->cmp( $x, $z ), "\n\n";
        

        And the output...

        
        Strict collation rules: Level 4 (default)
        	 cmp('α','ά'): -1
        	 cmp('ά','ὰ'): -1
        	 cmp('α','ὰ'): -1
        
        Relaxed collation rules: Level 1
        	 cmp('α','ά'): 0
        	 cmp('ά','ὰ'): 0
        	 cmp('α','ὰ'): 0
        
        

        And if the reason for doing comparisons is to handle sorting, Unicode::Collate does that too (you don't need to explicitly use Perl's core sort).


        Dave

Re: getting Unicode character names from string
by Jim (Curate) on Oct 11, 2012 at 01:30 UTC

    Sometimes a very ungeneral example makes a more immediately understandable demo:

    #!perl use v5.14; use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize qw( NFKD ); binmode STDOUT, ':encoding(UTF-8)'; my $greek_small_letter_alpha_with_oxia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH OXIA}"); my $greek_small_letter_alpha_with_varia = NFKD("\N{GREEK SMALL LETTER ALPHA WITH VARIA}"); my $greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_with_oxia; my $greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_with_varia; $greek_small_letter_alpha_without_oxia =~ s/\p{Nonspacing_Mark}//g; $greek_small_letter_alpha_without_varia =~ s/\p{Nonspacing_Mark}//g; my $greek_small_letter_alpha_without_oxia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_oxia; my $greek_small_letter_alpha_without_varia_code_point = sprintf 'U+%04x', ord $greek_small_letter_alpha_without_varia; my $output = <<END; \$greek_small_letter_alpha_with_oxia = $greek_small_letter_alpha_with_oxia \$greek_small_letter_alpha_with_varia = $greek_small_letter_alpha_with_varia \$greek_small_letter_alpha_without_oxia = $greek_small_letter_alpha_without_oxia \$greek_small_letter_alpha_without_varia = $greek_small_letter_alpha_without_varia \$greek_small_letter_alpha_without_oxia_code_point = $greek_small_letter_alpha_without_oxia_code_point \$greek_small_letter_alpha_without_varia_code_point = $greek_small_letter_alpha_without_varia_code_point END $output =~ s/(?<==)\n(?= )//g; print $output; exit 0;

    This script prints…

    $greek_small_letter_alpha_with_oxia                = ά
    $greek_small_letter_alpha_with_varia               = ὰ
    $greek_small_letter_alpha_without_oxia             = α
    $greek_small_letter_alpha_without_varia            = α
    $greek_small_letter_alpha_without_oxia_code_point  = U+03b1
    $greek_small_letter_alpha_without_varia_code_point = U+03b1
    

    The pattern here is to normalize the graphemes to Unicode NFKD and then strip them of all non-spacing characters. (But see http://stackoverflow.com/questions/5697171/regex-what-is-incombiningdiacriticalmarks for tchrist's much more detailed information about this pattern.)

Re: getting Unicode character names from string
by Tux (Monsignor) on Oct 11, 2012 at 06:40 UTC

    Feel free to use any code from my uchar-nopro script that I use on a daily basis in order to analyze encodings in undocumented data files:

    $ uchar-nopro -v λαάὰὸς
    λ U003bb \N{GREEK SMALL LETTER LAMDA}
    α U003b1 \N{GREEK SMALL LETTER ALPHA}
    ά U01f71 \N{GREEK SMALL LETTER ALPHA WITH OXIA}
    ὰ U01f70 \N{GREEK SMALL LETTER ALPHA WITH VARIA}
    ὸ U01f78 \N{GREEK SMALL LETTER OMICRON WITH VARIA}
    ς U003c2 \N{GREEK SMALL LETTER FINAL SIGMA}
    

    Enjoy, Have FUN! H.Merijn
Re: getting Unicode character names from string
by csthflk (Novice) on Oct 16, 2012 at 16:20 UTC
    Thanks to everyone who contributed to this thread. Very helpful. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998286]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2014-11-22 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (123 votes), past polls