Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Creating new character classes for foreign languages

by Polyglot (Monk)
on May 17, 2009 at 04:00 UTC ( #764472=note: print w/ replies, xml ) Need Help??


in reply to Re: Creating new character classes for foreign languages
in thread Creating new character classes for foreign languages

graff,

Your answer was excellent, and I thank you. It was just what I needed to get me started. I did some reading at the site that JavaFan also recommended and it was good too. I need to absorb more of it, I think, but that will come in time.

I have two questions now, that I have run into a need for, which I do not see addressed anywhere. The first is a very simple question.

1) Is it permissible to comment somehow within the subroutine's character set block? For example:

sub InThaiVowel { return <<'END'; 0E30 0E45 0E4D 0E22 #Thai consonant yo-yak can also be a vowel (like 'y' in English) 0E2D #Thai consonant or-ang can also be a vowel 0E27 #Thai consonant wo-wen is only a vowel following mai han-akat END }

2) Is it possible to define a double-character property? For example, the Thai 'r' becomes a vowel if, and only if, there are two of them together, as in 'rr'. It is then pronounced differently, and is no longer strictly an 'r'. How might I handle this? I suppose this would require some look-ahead assertions...but could these be incorporated into the subroutine in some way?

Thank you so much for your helpfulness!

Blessings,

~Polyglot~


Comment on Re^2: Creating new character classes for foreign languages
Download Code
Re^3: Creating new character classes for foreign languages
by Anonymous Monk on May 17, 2009 at 04:13 UTC
    Not like that, since you must return a specially-formatted string, so like this
    sub InThaiVowel { return join "\n", '0E30 0E45', '0E4D', '0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English +) '0E2D',#Thai consonant or-ang can also be a vowel '0E27',#Thai consonant wo-wen is only a vowel following mai han-akat }
Re^3: Creating new character classes for foreign languages
by JavaFan (Canon) on May 17, 2009 at 10:08 UTC
    Is it possible to define a double-character property?
    No.

    You are, after all, defining a character class. A character class almost always1 matches exactly one character, never taking context into account.

    How might I handle this?
    Define a rule, not a character class. Out of curiousity, which 'r' in 'rr' is the vowel? First one, second one, or both?

    1The only exception I can think of are cases with case insensitive matching, where the Unicode definition defines that the "other case" of a character is a multi character sequence.

      The double 'r' ends up sounding like "un", so I suppose that, technically, the first 'r' becomes the vowel 'u' while the second converts to an 'n'. However, they are considered as a single unit, much like the 'll' or 'ch' have their own places in alphabetical order for Spanish, as if they were single letters.

      Now, I've seen that the subroutines in the package file follow a specific syntax...what does a rule look like in a package file?

      Also, I had a little trouble when putting my new package to use, in that the "shortcut method" in the final routine here failed, and I ended up hard-coding the code points for those characters.

      sub InThaiHCons { #High-class consonants return <<'END'; 0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28 0E29 0E2A 0E2B END } sub InThaiMCons { #Middle-class consonants return <<'END'; 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 0E2D END } ################################ Low-class consonants =for NON-WORKING EXAMPLE sub InThaiLCons { #THIS DIDN'T WORK return <<'END'; +Thai::InThaiCons -Thai::InThaiHcons -Thai::InThaiMCons END } =cut sub InThaiLCons { #THIS DOES WORK return <<'END'; 0E04 0E07 0E0A 0E0D 0E11 0E13 0E17 0E19 0E1E 0E27 0E2C 0E2E END }

      Why?

      Thanks so much for your help!

      Blessings,

      ~Polyglot~

        However, they are considered as a single unit, much like the 'll' or 'ch' have their own places in alphabetical order for Spanish, as if they were single letters

        This was changed fifteen years ago (probably to make programmers happier). Now, officially, "ch" is sorted between "cg" and "ci" even if it is still considered a single letter and the same applies to "ll" (see http://en.wikipedia.org/wiki/Spanish_language#Writing_system).

        Regarding the thing that didn't work, did you leave something out from the code that you posted? The "non-working" definition for "InThaiLCons" makes a reference to a sub called "Thai::InThaiCons", but there is no such subroutine in the code you posted. Clearly, referring to a non-existent subroutine will lead to failure.

        Based on what you've posted, it looks like you can simply define your three subsets of consonants explicitly (including your definition of "InThaiLCons" that does work), and then create an overall "InThaiCons" sub by adding together the three subsets:

        sub InThaiCons { return <<END; +Thai::InThaiHcons +Thai::InThaiMCons +Thai::InThaiLCons END }
Re^3: Creating new character classes for foreign languages
by graff (Chancellor) on May 17, 2009 at 16:17 UTC
    Is it possible to define a double-character property? For example, the Thai 'r' becomes a vowel if, and only if, there are two of them together, as in 'rr'. It is then pronounced differently, and is no longer strictly an 'r'.

    Here you are moving away from strictly orthographic matters into phonetics or phonology, which are essentially context-dependent, and this takes you out of the domain of merely classifying letter symbols into related groups, which is essentially not context-dependent.

    If the goal is to provide a means for doing correct word segmentation of Thai text, the handling of the context-dependent rules (like "rr" becomes "un") should probably be in a separate module. The functions that work on sequences of characters will depend on the functions that define the basic character classes.

    (You probably could put the subroutines for character-classes and context-dependent rules together in one module if you want to, but the two sets of subroutines will have very different usages from the caller's point of view. And the overall problem being addressed is probably complicated enough that you will want to segregate portions of the solution into separate modules anyway.)

    Just curious: have you looked at Lingua::TH::Segmentation? I just happened to notice it was there, but I haven't tried it. Have you?

      Yes, I have looked at that Lingua::TH module. It fails to build on my system, and I have a hard enough time troubleshooting my own code, much less someone else's. The .pm file it has is only 2.2k, which amounts to a very slim algorithm for splitting Thai, as Thai is rather a complex problem when it comes to splitting. I'm actually leaning toward a lexical approach, and working on building a word list in Thai.

      In fact, I encountered errors of the wrong number of arguments upon running the 'perl Makefile.PL' command, and commented about five lines in the Makefile.PL before it would run...only to see a warning that the library file referred to was not present. So I'm thinking that it was designed to accompany some additional file, possibly a word lexicon.

      This is one of the reasons I'm embarking on this journey now. There is virtually nothing in CPAN for the Thai language, or for Lao either. (And I did some reading on CPAN today, having never submitted anything there before, and learned that a module's NAMESPACE is supposed to be community directed...but I know of no Thai community among Perl monks.)

      My needs go beyond splitting syllables. I plan to create a program which will translate Thai to Lao. There are some specific vowels and consonants that must be transposed in the exchange. Syllable splitting is a beginning, but only a part of the process. These tools I am packaging would be useful for many other purposes as well.

      Blessings,

      ~Polyglot~

        So I'm thinking that it was designed to accompany some additional file, possibly a word lexicon.

        Yes, that module is clearly intended to serve only as a wrapper around a separate compiled software library (not written in perl), provided here: http://thaiwordseg.sourceforge.net/.

        You have to install that library first (which will probably involve a simple sequence like ./configure; make; make install), and then try installing the perl module, which should include some tests that confirm whether the library was found and turns out to work as intended.

        I have nothing to add here, but I want to say that this is a fascinating thread and I want to thank you for starting it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://764472]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-12-25 22:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (163 votes), past polls