Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Okay, so I'm trying to parse a file that contains Japanese text (a dictionary file, in order to populate a spaced repetition system with a trainload of vocabulary words). I believe the encoding of the file is UTF8. The trouble I'm running into is in taking a word, which is written in kanji (possibly with okurigana) and separating it into graphemes. At least, I *think* graphemes is the right word. Given an example like "持って行く", I want to parse it into ("持", "っ", "て", "行", "く").

I've been reading up in perlunicode, and I have the following code. (Don't worry about the database stuff; that's all fine.)

#!/usr/bin/perl # -*- cperl -*- use utf8; require "./"; my ($pack) = findrecord('pack', 'packname', 'gjiten'); die "You must create the gjiten pack first (and make sure its language + is set correctly).\n" if not ref $pack; my $packid = $$pack{id}; my $langid = $$pack{language}; my $userid = $$pack{user} || 1; # We guess that the first user is prob +ably the sysadmin. my ($category) = findrecord('category', 'name', 'word'); die "How can there be no word category?\n" if not ref $category; my $catid = $$category{id}; my $dicfile = '/usr/share/gjiten/dics/edict'; open DIC, #'<:encoding(UTF-8)', '<', $dicfile or die "Cannot open dictionary: $dicfile"; my ($total, $skipped, $japanese, $already, $inserted); while (<DIC>) { my $line = $_; my ($firstchar) = $line =~ /^(.)/; ++$total; if ($firstchar le '~') { ++$skipped; print "Skipping (starts with low character '$firstchar'): $line"; } else { ++$japanese; die if $japanese > 50 or $total > 100; my ($word, @def) = split m{/}, $line; my ($spelling, $reading) = $word =~ /([^[ ]+) ?\s*(?:[[](.*?)[]])? +/; my ($chars, @char) = $spelling; #while ($chars) { # my ($c) = $chars =~ m/^(.)/; # $chars =~ s/^(.)//; # push @char, $c; #} #my @char = $spelling =~ m/(\X)/g; my @char = $spelling =~ m/(\P{M}\p{M}*)/g; use Data::Dumper; print Dumper(+{ word => $word, spelling => $spelling, reading => $reading, defs => \@def, char => \@char, }); } } print "Of $total lines, skipped $skipped.\nFound $japanese Japanese wo +rds. $already already had cards.\n$inserted new cards would be created.\n";

The words as printed out are correct, but the list of characters is a list of bytes. In some circumstances that would be what I want, but here it's not.

(To the best of my knowledge) the really key line there is this one:

my @char = $spelling =~ m/(\P{M}\p{M}*)/g;

I must have misunderstood something in the docs, because the way I read them that should match graphemes, and the commented out version of the pattern match, using \X, should do so well. (I'm using the Debian build of perl 5.10.0.) The commented out while loop above that is another attempt. All of them produce the same results: a list of bytes. My understanding of the docs leads me to believe these patterns should match graphemes, but clearly I'm confused about something. What am I doing wrong?

The one thing I did that produced different results was to change the open statement to use '<:encoding(UTF-8)' for the mode instead of '<', but that transforms everything (not just the chars, but also the words) into stuff like "\x{ff11}", which does not seem useful to me.

Do I have the file encoding wrong somehow? (The Gjiten docs say the file must be UTF-8 or Gjiten can't read it; Gjiten can read it, so I assume it's UTF-8. Is there a way to find out?)

In reply to Unicode: Perl5 equivalent to Perl6's @string.graphemes? by jonadab

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (12)
    As of 2014-08-21 14:42 GMT
    Find Nodes?
      Voting Booth?

      The best computer themed movie is:

      Results (136 votes), past polls