Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by TimToady (Parson)
on Nov 12, 2010 at 23:44 UTC ( #871170=note: print w/ replies, xml ) Need Help??


in reply to Unicode: Perl5 equivalent to Perl6's @string.graphemes?

Note that Japanese itself doesn't generally use mark characters at all, so worrying about them is not buying you much. (Some romanizations use U+031a (COMBINING LEFT ANGLE ABOVE) to indicate pitch accent, though.)


Comment on Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by ikegami (Pope) on Nov 13, 2010 at 00:02 UTC
    I can't speak as to whether the OP will encounter characters an in their decomposed forms or not, but about 40% of both Hiragana and Katakana have multi-code point decomposed forms. "ば" (U+3070, HIRAGANA LETTER BA) can be written as "は" (U+306F, HIRAGANA LETTER HA) plus combining "゛" (U+3099, COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).

      Contents of a Unicode (UTF-8) text file named DriedMangos.txt:

      dried mangos
      mangues séchées
      芒果幹
      doraido mangōsu
      ドライドマンゴス
      ドライドマンゴス
      ト"ライト"マンコ"ス

      Perl script to demonstrate matching Unicode grapheme clusters using the regular expression backslash sequence \X:

      #!perl use strict; use warnings; use autodie; open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; open my $output_fh, '>:encoding(UTF-8)', 'Graphemes.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(\X)/g) { print $output_fh "[$1]"; } print $output_fh "\n"; } close $input_fh; close $output_fh;

      Contents of the output text file named Graphemes.txt:

      [d][r][i][e][d][ ][m][a][n][g][o][s]
      [m][a][n][g][u][e][s][ ][s][é][c][h][é][e][s]
      [芒][果][幹]
      [d][o][r][a][i][d][o][ ][m][a][n][g][ō][s][u]
      [ド][ラ][イ][ド][マ][ン][ゴ][ス]
      [ド][ラ][イ][ド][マ][ン][ゴ][ス]
      [ト]["][ラ][イ][ト]["][マ][ン][コ]["][ス]

      (See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

        (See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

        Presumably, it says that plain double-quotes were used instead of a combining mark. As such, q{ コ" } is really two graphemes even though it approximates one (q{ ゴ }).

      Perl script to display the contents of the UTF-8 text file named DriedMangos.txt as a list of Unicode code points and character names:

      #!perl use strict; use warnings; use autodie; use Unicode::UCD qw( charinfo ); open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(.)/g) { my $character = $1; my $codepoint = ord $character; my $charinfo = charinfo($codepoint); my $code = "U+$charinfo->{'code'}"; my $name = $charinfo->{'name'}; print "$code $name\n"; } print "\n"; } close $input_fh;

      The output of the script:

      U+0064 LATIN SMALL LETTER D U+0072 LATIN SMALL LETTER R U+0069 LATIN SMALL LETTER I U+0065 LATIN SMALL LETTER E U+0064 LATIN SMALL LETTER D U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0073 LATIN SMALL LETTER S U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+0075 LATIN SMALL LETTER U U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+0020 SPACE U+0073 LATIN SMALL LETTER S U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0063 LATIN SMALL LETTER C U+0068 LATIN SMALL LETTER H U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+8292 CJK UNIFIED IDEOGRAPH-8292 U+679C CJK UNIFIED IDEOGRAPH-679C U+5E79 CJK UNIFIED IDEOGRAPH-5E79 U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0072 LATIN SMALL LETTER R U+0061 LATIN SMALL LETTER A U+0069 LATIN SMALL LETTER I U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0304 COMBINING MACRON U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+30C9 KATAKANA LETTER DO U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C9 KATAKANA LETTER DO U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B4 KATAKANA LETTER GO U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+0022 QUOTATION MARK U+30B9 KATAKANA LETTER SU

      The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://871170]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (18)
As of 2014-09-18 12:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (113 votes), past polls