http://www.perlmonks.org?node_id=871088

jonadab has asked for the wisdom of the Perl Monks concerning the following question:

Okay, so I'm trying to parse a file that contains Japanese text (a dictionary file, in order to populate a spaced repetition system with a trainload of vocabulary words). I believe the encoding of the file is UTF8. The trouble I'm running into is in taking a word, which is written in kanji (possibly with okurigana) and separating it into graphemes. At least, I *think* graphemes is the right word. Given an example like "持って行く", I want to parse it into ("持", "っ", "て", "行", "く").

I've been reading up in perlunicode, and I have the following code. (Don't worry about the database stuff; that's all fine.)

#!/usr/bin/perl # -*- cperl -*- use utf8; require "./db.pl"; my ($pack) = findrecord('pack', 'packname', 'gjiten'); die "You must create the gjiten pack first (and make sure its language + is set correctly).\n" if not ref $pack; my $packid = $$pack{id}; my $langid = $$pack{language}; my $userid = $$pack{user} || 1; # We guess that the first user is prob +ably the sysadmin. my ($category) = findrecord('category', 'name', 'word'); die "How can there be no word category?\n" if not ref $category; my $catid = $$category{id}; my $dicfile = '/usr/share/gjiten/dics/edict'; open DIC, #'<:encoding(UTF-8)', '<', $dicfile or die "Cannot open dictionary: $dicfile"; my ($total, $skipped, $japanese, $already, $inserted); while (<DIC>) { my $line = $_; my ($firstchar) = $line =~ /^(.)/; ++$total; if ($firstchar le '~') { ++$skipped; print "Skipping (starts with low character '$firstchar'): $line"; } else { ++$japanese; die if $japanese > 50 or $total > 100; my ($word, @def) = split m{/}, $line; my ($spelling, $reading) = $word =~ /([^[ ]+) ?\s*(?:[[](.*?)[]])? +/; my ($chars, @char) = $spelling; #while ($chars) { # my ($c) = $chars =~ m/^(.)/; # $chars =~ s/^(.)//; # push @char, $c; #} #my @char = $spelling =~ m/(\X)/g; my @char = $spelling =~ m/(\P{M}\p{M}*)/g; use Data::Dumper; print Dumper(+{ word => $word, spelling => $spelling, reading => $reading, defs => \@def, char => \@char, }); } } print "Of $total lines, skipped $skipped.\nFound $japanese Japanese wo +rds. $already already had cards.\n$inserted new cards would be created.\n";

The words as printed out are correct, but the list of characters is a list of bytes. In some circumstances that would be what I want, but here it's not.

(To the best of my knowledge) the really key line there is this one:

my @char = $spelling =~ m/(\P{M}\p{M}*)/g;

I must have misunderstood something in the docs, because the way I read them that should match graphemes, and the commented out version of the pattern match, using \X, should do so well. (I'm using the Debian build of perl 5.10.0.) The commented out while loop above that is another attempt. All of them produce the same results: a list of bytes. My understanding of the docs leads me to believe these patterns should match graphemes, but clearly I'm confused about something. What am I doing wrong?

The one thing I did that produced different results was to change the open statement to use '<:encoding(UTF-8)' for the mode instead of '<', but that transforms everything (not just the chars, but also the words) into stuff like "\x{ff11}", which does not seem useful to me.

Do I have the file encoding wrong somehow? (The Gjiten docs say the file must be UTF-8 or Gjiten can't read it; Gjiten can read it, so I assume it's UTF-8. Is there a way to find out?)

Replies are listed 'Best First'.
Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by ikegami (Patriarch) on Nov 12, 2010 at 16:45 UTC

    The one thing I did that produced different results was to change the open statement to use '<:encoding(UTF-8)' for the mode instead of '<', but that transforms everything (not just the chars, but also the words) into stuff like "\x{ff11}", which does not seem useful to me.

    That is the correct fix. Dumper produces Perl code, primarily for debugging purposes. When it comes to characters where encoding is likely to matter, it uses escapes to avoid mixups. As a debugging tool, it rather produce some harder to read output then producing output that looks wrong because the caller didn't properly encode the output.

    If you hadn't used Dumper (just printed the string) and if you encoded your output (use open ':std', ':locale';), then you would get the actual characters.

    (Each of the graphemes you posted were represented by a single character, so I didn't bother using \P{M}.)

    use strict; use warnings; use open ':std', ':locale'; use Data::Dumper qw( Dumper ); my $file = do { open(my $fh, '<:encoding(UTF-8)', 'jap') or die $!; local $/; <$fh> }; print(Dumper($file)); print("[$_]") for $file =~ /(.)/sg; print("\n");
    $VAR1 = "\x{6301}\x{3063}\x{3066}\x{884c}\x{304f}
    ";
    [持][っ][て][行][く][
    ]
    
      Dumper ... uses escapes to avoid mixups.

      Ah, that's what I was misunderstanding. I saw that stuff and thought the encoding handling was doing it and that that's what my data were actually looking like, which would be bad. If that's just Dumper's way of escaping non-ASCII characters, I can deal with that. Thanks a million. I thought I was going insane.

      Can you please explain why you used use open ':std', ':locale' instead of, say, use open qw( :encoding(UTF-8) :std )? What do :std and :locale together do?

      (Learning how to process Unicode text using Perl: One step forward, two steps back. Every time I think I've learned something, I haven't.)

        What do :std and :locale together do?

        The same as :std and :encoding together, just without having to specify the encoding.

        Can you please explain why you used use open ':std', ':locale'

        Because I don't know the encoding of the terminals in which your program will run. I don't even know that they all use the same encoding.

Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by TimToady (Parson) on Nov 12, 2010 at 23:44 UTC
    Note that Japanese itself doesn't generally use mark characters at all, so worrying about them is not buying you much. (Some romanizations use U+031a (COMBINING LEFT ANGLE ABOVE) to indicate pitch accent, though.)
      I can't speak as to whether the OP will encounter characters an in their decomposed forms or not, but about 40% of both Hiragana and Katakana have multi-code point decomposed forms. "ば" (U+3070, HIRAGANA LETTER BA) can be written as "は" (U+306F, HIRAGANA LETTER HA) plus combining "゛" (U+3099, COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).

        Contents of a Unicode (UTF-8) text file named DriedMangos.txt:

        dried mangos
        mangues séchées
        芒果幹
        doraido mangōsu
        ドライドマンゴス
        ドライドマンゴス
        ト"ライト"マンコ"ス

        Perl script to demonstrate matching Unicode grapheme clusters using the regular expression backslash sequence \X:

        #!perl use strict; use warnings; use autodie; open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; open my $output_fh, '>:encoding(UTF-8)', 'Graphemes.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(\X)/g) { print $output_fh "[$1]"; } print $output_fh "\n"; } close $input_fh; close $output_fh;

        Contents of the output text file named Graphemes.txt:

        [d][r][i][e][d][ ][m][a][n][g][o][s]
        [m][a][n][g][u][e][s][ ][s][é][c][h][é][e][s]
        [芒][果][幹]
        [d][o][r][a][i][d][o][ ][m][a][n][g][ō][s][u]
        [ド][ラ][イ][ド][マ][ン][ゴ][ス]
        [ド][ラ][イ][ド][マ][ン][ゴ][ス]
        [ト]["][ラ][イ][ト]["][マ][ン][コ]["][ス]

        (See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

        Perl script to display the contents of the UTF-8 text file named DriedMangos.txt as a list of Unicode code points and character names:

        #!perl use strict; use warnings; use autodie; use Unicode::UCD qw( charinfo ); open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(.)/g) { my $character = $1; my $codepoint = ord $character; my $charinfo = charinfo($codepoint); my $code = "U+$charinfo->{'code'}"; my $name = $charinfo->{'name'}; print "$code $name\n"; } print "\n"; } close $input_fh;

        The output of the script:

        U+0064 LATIN SMALL LETTER D U+0072 LATIN SMALL LETTER R U+0069 LATIN SMALL LETTER I U+0065 LATIN SMALL LETTER E U+0064 LATIN SMALL LETTER D U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0073 LATIN SMALL LETTER S U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+0075 LATIN SMALL LETTER U U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+0020 SPACE U+0073 LATIN SMALL LETTER S U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0063 LATIN SMALL LETTER C U+0068 LATIN SMALL LETTER H U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+8292 CJK UNIFIED IDEOGRAPH-8292 U+679C CJK UNIFIED IDEOGRAPH-679C U+5E79 CJK UNIFIED IDEOGRAPH-5E79 U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0072 LATIN SMALL LETTER R U+0061 LATIN SMALL LETTER A U+0069 LATIN SMALL LETTER I U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0304 COMBINING MACRON U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+30C9 KATAKANA LETTER DO U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C9 KATAKANA LETTER DO U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B4 KATAKANA LETTER GO U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+0022 QUOTATION MARK U+30B9 KATAKANA LETTER SU

        The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.

Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Anonymous Monk on Nov 12, 2010 at 21:24 UTC

    Hm. Are you sure about \p{M}? In perldoc perluniprops this is defined as matching "Mark" (whatever it means). You need something like \p{InHiragana}. Better yet, define your own property that would match what you really need. Read "perldoc perlunicode".

    Also make sure that edict is really in UTF-8. The simplest is to open it in vim editor and then check the encoding. Normally vim uses utf-8 so if japanese is displayed correctly, then it is UTF-8. If not, then it is somethings else (I know that EDICT disctributed by WWWJDIC is in EUC-JP).

      Hm. Are you sure about \p{M}?

      Yes. /\P{M}\p{M}*/ is a poor man's version of (only recently available) /\X/. The idea is to match what the reader would consider a character. These are called "graphemes". Graphemes can be formed by more than one Unicode code points. For example, this instance of grapheme "é" is composed using code points U+0065 (LATIN SMALL LETTER E) plus U+0301 (COMBINING ACUTE ACCENT). U+0065 matches /\P{M}/, and U+0301 matches /\p{M}/.

      He simply needs to apply the regex pattern against the decoded text (as his commented out code would do) rather than apply the regexp against the UTF-8 bytes that represent the text.

      Also make sure that edict is really in UTF-8.

      It surely is since he got a U+FF11 (FULLWIDTH DIGIT ONE, a "1" as wide as a Japanese character) when treating the input as UTF-8.