Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by Anonymous Monk
on Nov 12, 2010 at 21:24 UTC ( #871148=note: print w/ replies, xml ) Need Help??


in reply to Unicode: Perl5 equivalent to Perl6's @string.graphemes?

Hm. Are you sure about \p{M}? In perldoc perluniprops this is defined as matching "Mark" (whatever it means). You need something like \p{InHiragana}. Better yet, define your own property that would match what you really need. Read "perldoc perlunicode".

Also make sure that edict is really in UTF-8. The simplest is to open it in vim editor and then check the encoding. Normally vim uses utf-8 so if japanese is displayed correctly, then it is UTF-8. If not, then it is somethings else (I know that EDICT disctributed by WWWJDIC is in EUC-JP).


Comment on Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
Replies are listed 'Best First'.
Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by ikegami (Pope) on Nov 12, 2010 at 22:30 UTC

    Hm. Are you sure about \p{M}?

    Yes. /\P{M}\p{M}*/ is a poor man's version of (only recently available) /\X/. The idea is to match what the reader would consider a character. These are called "graphemes". Graphemes can be formed by more than one Unicode code points. For example, this instance of grapheme "é" is composed using code points U+0065 (LATIN SMALL LETTER E) plus U+0301 (COMBINING ACUTE ACCENT). U+0065 matches /\P{M}/, and U+0301 matches /\p{M}/.

    He simply needs to apply the regex pattern against the decoded text (as his commented out code would do) rather than apply the regexp against the UTF-8 bytes that represent the text.

    Also make sure that edict is really in UTF-8.

    It surely is since he got a U+FF11 (FULLWIDTH DIGIT ONE, a "1" as wide as a Japanese character) when treating the input as UTF-8.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://871148]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2015-07-08 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (94 votes), past polls