Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Creating new character classes for foreign languages

by Polyglot (Monk)
on May 16, 2009 at 14:25 UTC ( #764420=perlquestion: print w/ replies, xml ) Need Help??
Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

For those who like to feel they are helping the world, this may be a tangible opportunity. I have done some research on the Perl packages for handling Thai or Lao languages (which are similar, but with separate unicode code points), and found none which will properly classify the characters in the language. The only classes identified are to determine if a particular character is in that language, as in:

\p{InThai} | \p{InLao}

However, these languages have, for example, three classes of consonants: high, middle, and low. These classes, in conjunction with tone marks, determine the tone of each syllable, and play an important role in determining the boundaries for words and syllables (these languages do not space-delimit words).

The unicode documentation I found on some of the packages on CPAN mentioned that the programmers did not know Thai, and could not, therefore, do much of usefulness with it.

I'm trying to create, then, a set of classes for these characters, but have never created a perl package before, only used them. The Perl book I have was vague in how to define the subroutines for this, and has left me confused.

I'm not asking for anyone to create the package. I only ask for someone to point me in the right direction. If we were to assume that x = high class, y = low class, and z = middle class, could anyone give me an example of how I might make a package that could be used something like this:

use Thai;

$line =~ m/\p{ThaiHighClass}\p{ThaiLowClass}/;

Naturally, I would like to define much more than consonant classes for these languages, but if I could do just this much, the rest would be easy to add.

Help will be much appreciated, and success may mean an addition to CPAN.

Blessings,

Polyglot

Comment on Creating new character classes for foreign languages
Re: Creating new character classes for foreign languages
by JavaFan (Canon) on May 16, 2009 at 15:22 UTC
    It's quite easy to create your own character classes. See the section User-Defined Character Properties in the perlunicode manual page. For instance:
    sub IsVowel {<<"EOT"} 041 045 049 04F 055 061 065 069 06F 075 EOT "A" =~ /\p{IsVowel}; # Matches "B" =~ /\p{IsVowel}; # Does not match.
Re: Creating new character classes for foreign languages
by graff (Chancellor) on May 16, 2009 at 19:31 UTC
    To expand a bit on what JavaFan said, definitely look up the part he mentioned in perlunicode; and there's a nice quick intro to it in this post by japhy: Re: Re: japhy's regex article for the TPJ.

    I've tried this out myself to get some character classes of interest for Arabic (it's not ready for distro yet because I need to define a few more relevant classes, but I'll try to get it out on CPAN pretty soon). The general layout goes like this:

      graff,

      Your answer was excellent, and I thank you. It was just what I needed to get me started. I did some reading at the site that JavaFan also recommended and it was good too. I need to absorb more of it, I think, but that will come in time.

      I have two questions now, that I have run into a need for, which I do not see addressed anywhere. The first is a very simple question.

      1) Is it permissible to comment somehow within the subroutine's character set block? For example:

      sub InThaiVowel { return <<'END'; 0E30 0E45 0E4D 0E22 #Thai consonant yo-yak can also be a vowel (like 'y' in English) 0E2D #Thai consonant or-ang can also be a vowel 0E27 #Thai consonant wo-wen is only a vowel following mai han-akat END }

      2) Is it possible to define a double-character property? For example, the Thai 'r' becomes a vowel if, and only if, there are two of them together, as in 'rr'. It is then pronounced differently, and is no longer strictly an 'r'. How might I handle this? I suppose this would require some look-ahead assertions...but could these be incorporated into the subroutine in some way?

      Thank you so much for your helpfulness!

      Blessings,

      ~Polyglot~

        Not like that, since you must return a specially-formatted string, so like this
        sub InThaiVowel { return join "\n", '0E30 0E45', '0E4D', '0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English +) '0E2D',#Thai consonant or-ang can also be a vowel '0E27',#Thai consonant wo-wen is only a vowel following mai han-akat }
        Is it possible to define a double-character property?
        No.

        You are, after all, defining a character class. A character class almost always1 matches exactly one character, never taking context into account.

        How might I handle this?
        Define a rule, not a character class. Out of curiousity, which 'r' in 'rr' is the vowel? First one, second one, or both?

        1The only exception I can think of are cases with case insensitive matching, where the Unicode definition defines that the "other case" of a character is a multi character sequence.

        Is it possible to define a double-character property? For example, the Thai 'r' becomes a vowel if, and only if, there are two of them together, as in 'rr'. It is then pronounced differently, and is no longer strictly an 'r'.

        Here you are moving away from strictly orthographic matters into phonetics or phonology, which are essentially context-dependent, and this takes you out of the domain of merely classifying letter symbols into related groups, which is essentially not context-dependent.

        If the goal is to provide a means for doing correct word segmentation of Thai text, the handling of the context-dependent rules (like "rr" becomes "un") should probably be in a separate module. The functions that work on sequences of characters will depend on the functions that define the basic character classes.

        (You probably could put the subroutines for character-classes and context-dependent rules together in one module if you want to, but the two sets of subroutines will have very different usages from the caller's point of view. And the overall problem being addressed is probably complicated enough that you will want to segregate portions of the solution into separate modules anyway.)

        Just curious: have you looked at Lingua::TH::Segmentation? I just happened to notice it was there, but I haven't tried it. Have you?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://764420]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-09-16 10:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (10 votes), past polls