Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: getting Unicode character names from string

by davido (Archbishop)
on Oct 10, 2012 at 23:28 UTC ( #998343=note: print w/ replies, xml ) Need Help??


in reply to getting Unicode character names from string

Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).

In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.

I believe the example code below answers both of your questions:

# If using a Perl version prior to v5.16, comment out the "use feature +" line, # and uncomment the BEGIN{...} block. use feature ':5.16'; #BEGIN { # die "Must install Unicode::CaseFold." if ! eval "use Unicode::CaseF +old; 1;"; #} use strict; use warnings FATAL => 'utf8'; use utf8; use charnames ':full'; use Unicode::Normalize qw(NFD NFC); binmode STDOUT, ':encoding(UTF-8)'; my $string = "&#955;&#945;&#8049;&#8048;&#8056;&#962;"; while ( $string =~ m/(?<grapheme>\X)/g ) { my $grapheme = $+{grapheme}; print explain( $+{grapheme} ), "\n"; } sub explain { my $grapheme = shift; my %pri = decompose( $grapheme ); my %base = decompose( $pri{base} ); my $output = <<"END_OUTPUT"; Grapheme:($grapheme) Dec, Hex, Name: [$pri{cp}], [$pri{hex_str}], '$pri{name} +' Case: (Fold,Lower,Upper): ($pri{fc}), ($pri{lc}), ($pri{uc}) Grapheme Base: ($pri{base}), [$base{hex_str}], '$base{n +ame}' END_OUTPUT foreach my $extend ( @{$pri{comb}} ) { my %ext = decompose( $extend ); my $grapheme = fc $ext{grapheme}; $output .= <<"END_OUTPUT"; Combining Mark: ($grapheme ) Dec, Hex, Name: [$ext{cp}], [$ext{hex_str}], '$ext{name}' END_OUTPUT } return $output; } sub decompose { my $grapheme = shift; my $decomp = NFD( $grapheme ); my $cp = ord $grapheme; my ( $base ) = substr($decomp, 0, 1 ); my ( @comb ) = map { substr $decomp, $_, 1 } 1 .. length($decomp)-1; return ( grapheme => $grapheme, cp => $cp, hex_str => sprintf( "%#0.4x", $cp ), name => charnames::viacode( $cp ), lc => lc $grapheme, uc => uc $grapheme, fc => fc $grapheme, base => $base, comb => [ @comb ], ); }

I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.

The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.

The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.

If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".


Dave


Comment on Re: getting Unicode character names from string
Select or Download Code
Re^2: getting Unicode character names from string
by Tux (Monsignor) on Oct 11, 2012 at 06:41 UTC

    These are the rare instances where you have to revert to using <pre> tags instead of <code> tags.


    Enjoy, Have FUN! H.Merijn
Re^2: getting Unicode character names from string
by Jim (Curate) on Oct 11, 2012 at 18:23 UTC
    If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

    Are there clear examples of how to do this in the documentation? Or elsewhere on the interwebs? Nothing's jumping out at me when I look/search. Thanks!

      The synopsis for Unicode::Collate does a reasonable job of setting the stage, but there is a nice discussion in chapter 6 of Programming Perl (the camel book), 4th edition as well. You might also look at the Unicode Technical Standard #10: Unicode Collation Algorithm.

      Here's a brief example of doing comparisons at a lower (more relaxed) level using Unicode::Collate.

      use strict;
      use warnings FATAL => 'utf8';
      use utf8;
      
      use Unicode::Collate;
      binmode STDOUT, ':encoding(UTF-8)';
      
      my( $x, $y, $z ) = qw( α ά ὰ );
      
      my $c = Unicode::Collate->new;
      
      print "\nStrict collation rules: Level 4 (default)\n";
      print "\t cmp('α','ά'): ", $c->cmp( $x, $y ), "\n";
      print "\t cmp('ά','ὰ'): ", $c->cmp( $y, $z ), "\n";
      print "\t cmp('α','ὰ'): ", $c->cmp( $x, $z ), "\n";
      
      my $rc = Unicode::Collate->new( level => 1 );
      
      print "\nRelaxed collation rules: Level 1\n";
      print "\t cmp('α','ά'): ", $rc->cmp( $x, $y ), "\n";
      print "\t cmp('ά','ὰ'): ", $rc->cmp( $y, $z ), "\n";
      print "\t cmp('α','ὰ'): ", $rc->cmp( $x, $z ), "\n\n";
      

      And the output...

      
      Strict collation rules: Level 4 (default)
      	 cmp('α','ά'): -1
      	 cmp('ά','ὰ'): -1
      	 cmp('α','ὰ'): -1
      
      Relaxed collation rules: Level 1
      	 cmp('α','ά'): 0
      	 cmp('ά','ὰ'): 0
      	 cmp('α','ὰ'): 0
      
      

      And if the reason for doing comparisons is to handle sorting, Unicode::Collate does that too (you don't need to explicitly use Perl's core sort).


      Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998343]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (11)
As of 2014-08-20 22:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (124 votes), past polls