comment on

Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).

In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.

I believe the example code below answers both of your questions:

# If using a Perl version prior to v5.16, comment out the "use feature
+" line,
# and uncomment the BEGIN{...} block.

use feature ':5.16';

#BEGIN {
#  die "Must install Unicode::CaseFold." if ! eval "use Unicode::CaseF
+old; 1;";
#}

use strict;
use warnings FATAL => 'utf8';
use utf8;
use charnames ':full';

use Unicode::Normalize qw(NFD NFC);


binmode STDOUT, ':encoding(UTF-8)';


my $string = "&#955;&#945;&#8049;&#8048;&#8056;&#962;";

while ( $string =~ m/(?<grapheme>\X)/g ) {
  my $grapheme  = $+{grapheme};
  print explain( $+{grapheme} ), "\n";
}

sub explain {
  my $grapheme = shift;
  my %pri = decompose( $grapheme );
  my %base = decompose( $pri{base} );
  my $output = <<"END_OUTPUT";
Grapheme:($grapheme)
    Dec, Hex, Name:           [$pri{cp}], [$pri{hex_str}], '$pri{name}
+'
    Case: (Fold,Lower,Upper): ($pri{fc}), ($pri{lc}), ($pri{uc})
    Grapheme Base:            ($pri{base}), [$base{hex_str}], '$base{n
+ame}'
END_OUTPUT
  foreach my $extend ( @{$pri{comb}} ) {
    my %ext = decompose( $extend );
    my $grapheme = fc $ext{grapheme};
    $output .= <<"END_OUTPUT";
    Combining Mark: ($grapheme )
        Dec, Hex, Name: [$ext{cp}], [$ext{hex_str}], '$ext{name}'
END_OUTPUT
  }
  return $output;
}

sub decompose {
  my $grapheme = shift;
  my $decomp   = NFD( $grapheme );
  my $cp       = ord $grapheme;
  my ( $base ) = substr($decomp, 0, 1 );
  my ( @comb ) = map { substr $decomp, $_, 1 } 1 .. length($decomp)-1;
  return (
    grapheme => $grapheme,
    cp       => $cp,
    hex_str  => sprintf( "%#0.4x", $cp ),
    name     => charnames::viacode( $cp ),
    lc       => lc $grapheme,
    uc       => uc $grapheme,
    fc       => fc $grapheme,
    base     => $base,
    comb     => [ @comb ],
  );
}
[download]

I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.

The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.

The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.

If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

Dave

In reply to Re: getting Unicode character names from string by davido
in thread getting Unicode character names from string by csthflk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks