Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Okay, so I'm trying to parse a file that contains Japanese text (a dictionary file, in order to populate a spaced repetition system with a trainload of vocabulary words). I believe the encoding of the file is UTF8. The trouble I'm running into is in taking a word, which is written in kanji (possibly with okurigana) and separating it into graphemes. At least, I *think* graphemes is the right word. Given an example like "持って行く", I want to parse it into ("持", "っ", "て", "行", "く").

I've been reading up in perlunicode, and I have the following code. (Don't worry about the database stuff; that's all fine.)

#!/usr/bin/perl # -*- cperl -*- use utf8; require "./db.pl"; my ($pack) = findrecord('pack', 'packname', 'gjiten'); die "You must create the gjiten pack first (and make sure its language + is set correctly).\n" if not ref $pack; my $packid = $$pack{id}; my $langid = $$pack{language}; my $userid = $$pack{user} || 1; # We guess that the first user is prob +ably the sysadmin. my ($category) = findrecord('category', 'name', 'word'); die "How can there be no word category?\n" if not ref $category; my $catid = $$category{id}; my $dicfile = '/usr/share/gjiten/dics/edict'; open DIC, #'<:encoding(UTF-8)', '<', $dicfile or die "Cannot open dictionary: $dicfile"; my ($total, $skipped, $japanese, $already, $inserted); while (<DIC>) { my $line = $_; my ($firstchar) = $line =~ /^(.)/; ++$total; if ($firstchar le '~') { ++$skipped; print "Skipping (starts with low character '$firstchar'): $line"; } else { ++$japanese; die if $japanese > 50 or $total > 100; my ($word, @def) = split m{/}, $line; my ($spelling, $reading) = $word =~ /([^[ ]+) ?\s*(?:[[](.*?)[]])? +/; my ($chars, @char) = $spelling; #while ($chars) { # my ($c) = $chars =~ m/^(.)/; # $chars =~ s/^(.)//; # push @char, $c; #} #my @char = $spelling =~ m/(\X)/g; my @char = $spelling =~ m/(\P{M}\p{M}*)/g; use Data::Dumper; print Dumper(+{ word => $word, spelling => $spelling, reading => $reading, defs => \@def, char => \@char, }); } } print "Of $total lines, skipped $skipped.\nFound $japanese Japanese wo +rds. $already already had cards.\n$inserted new cards would be created.\n";

The words as printed out are correct, but the list of characters is a list of bytes. In some circumstances that would be what I want, but here it's not.

(To the best of my knowledge) the really key line there is this one:

my @char = $spelling =~ m/(\P{M}\p{M}*)/g;

I must have misunderstood something in the docs, because the way I read them that should match graphemes, and the commented out version of the pattern match, using \X, should do so well. (I'm using the Debian build of perl 5.10.0.) The commented out while loop above that is another attempt. All of them produce the same results: a list of bytes. My understanding of the docs leads me to believe these patterns should match graphemes, but clearly I'm confused about something. What am I doing wrong?

The one thing I did that produced different results was to change the open statement to use '<:encoding(UTF-8)' for the mode instead of '<', but that transforms everything (not just the chars, but also the words) into stuff like "\x{ff11}", which does not seem useful to me.

Do I have the file encoding wrong somehow? (The Gjiten docs say the file must be UTF-8 or Gjiten can't read it; Gjiten can read it, so I assume it's UTF-8. Is there a way to find out?)


In reply to Unicode: Perl5 equivalent to Perl6's @string.graphemes? by jonadab

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-19 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found