Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Namespace/advice for new CPAN modules for Thai & Lao

by Polyglot (Chaplain)
on Mar 22, 2015 at 18:56 UTC ( [id://1120907]=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

I just signed up on PAUSE, and am finally willing to submit my first modules. Everyone recommends that newbies get advice on how to do this, especially as pertains the naming of the modules, so I present the matter here for your inspection. Having tried to research the matter, I'm a little conflicted about which category these modules would best fit, so your advice is much appreciated.

The synopsis is that these are very basic modules with respect to handling of the Thai/Lao character sets. In Thai and in Lao, each character/codepoint can have one or more categorizations, like vowel/consonant and uppercase/lowercase in English, but more complex. The current Unicode pragma available allows only the /\p{InThai}/ method of identification, so my module expands the regexp tokens to include such as:

  • \p{InThaiCons} (consonants)
  • \p{InThaiLCons} (low-class consonants)
  • \p{InThaiMCons} (mid-class consonants)
  • \p{InThaiHCons} (high-class consonants)
  • \p{InThaiVowel} (all possible vowels)
  • \p{InThaiPreVowel} (vowels that precede their consonants)
  • etc. (see more in code below)

This is a module that will be useful in any textual manipulation, such as word/syllable identification or splitting (words are not normally split with whitespace as in English). It is a very simple module, whose features may be amended/augmented in the future with some additional capability, but whose present utility is readily apparent.

Now, for the code example....I'll present the Thai one, but the Lao is nearly the same, but on the Lao charset.

package Regexp::Thai::CharClasses; use 5.008003; use strict; use warnings; require Exporter; our $VERSION = '1.01'; our @ISA = qw(Exporter); our @EXPORT = qw( InThai InThaiCons InThaiHCons InThaiMCons InThaiLCons InThaiVowel InThaiPreVowel InThaiPostVowel InThaiCompVowel InThaiDigit InThaiTon +e InThaiPunct ); =head1 NAME Regexp::Thai::CharClasses - useful character properties for Unicode T +hai =head1 SYNOPSIS use Regexp::Thai::CharClasses; $char = "..."; # some UTF8 string $char =~ /\p{InThaiCons}/; # match a Thai consonant $char =~ /\p{InThaiTone}/; # match a Thai tone mark # see description for full set of terms =head1 DESCRIPTION This module supplements the Unicode character-class definitions with special groups relevant to Thai linguistics. The following classes are defined: =over 4 =item InThai Matches ALL characters in the Thai unicode code-point range. =item InThaiCons Matches Thai consonant letters, leaving out vowels, numerics, tone mar +ks, etc. =item InThaiVowel Matches Thai vowels, including compounded and free-standing vowels. NOTE: Exceptions here include several of the "consonants" which also s +erve as vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen. + These are included as vowels in this grouping to accept the widest pos +sible definition, but cannot with certainty be determined by this to be in u +se as actual vowels in the instance of their identification here. =item InThaiAlpha Matches only the Thai alphabetic characters (consonants and vowels), excluding all digits, tone marks, and punctuation marks. =item InThaiTone Matches only the Thai tone marks, leaving out all letters, digits and punctuation marks. =item InThaiPunct Matches Thai punctuation characters, not including tone marks, white space, digits or alphabetic characters, and not including non-Thai punctuation marks (such as English [.,'"!?] etc.). =item InThaiCompVowel Matches only the Thai vowels which are compounded with a Thai consonan +t, and matching only the vowel portion of the compounded character. =item InThaiPreVowel Matches only the subset of vowels which appear _before_ the consonant with which they are associated (though in Thai they are sounded _after +_ said consonant); this excludes all consonant-vowels and does not inclu +de any of the compounded vowels. =item InThaiPostVowel Matches only the vowels which appear _after_ the consonant with which they are associated; this excludes all consonant-vowels and does not include any of the compounded vowels. =item InThaiHCons Matches high-class Thai consonants. =item InThaiMCons Matches middle-class Thai consonants. =item InThaiLCons Matches low-class Thai consonants. =item InThaiDigit Matches Thai numerical digits only. =back =cut sub InThai { return <<'END'; 0E01 0E5B END } sub InThaiCons { return <<'END'; 0E01 0E2E END } sub InThaiVowel { return join "\n", '0E30 0E45', '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) '0E4D', '0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English +) '0E2D',#Thai consonant or-ang can also be a vowel '0E27',#Thai consonant wo-wen is only a vowel following mai han-akat } sub InThaiAlpha { return <<'END'; 0E01 0E2E 0E30 0E45 0E47 0E4D 0E22 0E2D 0E27 END } sub InThaiTone { return <<'END'; 0E48 0E4B END } sub InThaiPunct { return <<'END'; 0E46 0E4C 0E4E 0E4F 0E5A 0E5B END } sub InThaiCompVowel { return join "\n", '0E31',#Thai mai han-akat '0E34',#Thai sara-i '0E35',#Thai sara-ii '0E36',#Thai sara-ue '0E37',#Thai sara-uee '0E38',#Thai sara-u '0E39',#Thai sara-uu '0E3A',#Thai phinthu '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) } sub InThaiPreVowel { return <<'END'; 0E40 0E44 END } sub InThaiPostVowel { return <<'END'; 0E45 0E30 0E32 0E33 END } sub InThaiHCons { return <<'END'; 0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28 0E29 0E2A 0E2B END } sub InThaiMCons { return <<'END'; 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 0E2D END } sub InThaiLCons { return <<'END'; 0E04 0E07 0E0A 0E0D 0E11 0E13 0E17 0E19 0E1E 0E27 0E2C 0E2E END } sub InThaiDigit { return <<'END'; 0E50 0E59 END } =head1 AUTHOR Erik Mundall =head1 COPYRIGHT Copyright (C) 2015 Erik Mundall. All Rights Reserved. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;

For names, I've considered Lingua and some others, but this is so directly Regexp related as to make me think it might better live there. I'm fully open to suggestions. As an entirely self-taught coder who is only a hobbyist at it, and a teacher by trade, I'm also open to corrections on the code itself. Regarding the "Export" feature, I know that it is deprecated to export all the functions, but I just cannot visualize the need to separate these out--like, how often would someone want to know only the vowels, and, if so, how much would be gained by specifying only such? The added complexity, versus the matter of namespace, seems to my mind to be a net disadvantage considering the namespace here is very specific as it is and unlikely to present a problem. Yet I will readily listen to those of greater experience.

LATEST UPDATE:

Suggested names so far have included:

  • Unicode::X::Y
  • Lingua::X::Y
  • Regexp::Thai::Properties
  • Regexp::Thai::X
  • Encode::InCharset::Polyglot::Thai
  • Encode::Th::PolyglotProperties
  • Regexp::UTF8::Thai
  • Regexp::Thai::UTF8
  • Regexp::CharProps::Thai

At this point, I've updated the name of the package above to reflect what I am most strongly leaning toward, a slight modification of the suggestions presented in the list above: Regexp::Thai::CharClasses. The floor is still open for suggestions.

Thank you for your help.

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Laurent_R (Canon) on Mar 22, 2015 at 22:50 UTC
    I am really not a specialist, but just my 2-cents. On the name exporting question, I think there would probably a number of cases where you would need only some of your regex categories (say, just numbers, InThaiDigit, or perhaps InThaiPunct) without any need for others.

    You could have an 'all' group of names, to be used with something like this:

    use Regexp::Thai ':all';
    when you want to import the whole shebang.

    I think it is a little bit cleaner to do it this way, and it might probably a bit easier to manage if you add new features in the future.

    Having said that, this may not be so important. The user can always do something like this:

    use Regexp::Thai (); use Regexp::Thai (the_specific_function_that_I_need);
    to prevent unwanted imports.

    In the modules that I wrote, I usually only exported automatically only the functions that are absolutely needed for the rest of the module to work properly (for example, the init function), which must be called for any other function of the module to work properly.

    But again, these are just my 2-cents, I am really not an expert on this subject.

    Je suis Charlie.
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Thoughtstream (Novice) on Mar 23, 2015 at 05:02 UTC

    In naming a module like this, which improves the core features of Perl, it can help to think ahead to what else you plan to do with it, and what other modules by other authors you imagine might coexist in the namespace.

    For example, if you choose Regexp::Thai, then you are basically accepting responsibility for everything Thai-related in regexes. When someone else wants to add some other feature for using Thai with regexes, they'll now have to find another, less appropriate name. Or, at least, use a sub-namespace, which may be confusing to users when their module is (in most respects) unrelated to yours.

    Or, when you want to add other Thai-related regex features, you're going to have to expand that same module (because it has already "claimed" the general name). Those extra, perhaps only loosely related features, will complicate the module's interface, and make it "heavier" to load for your existing users too.

    So perhaps Regexp::Thai::Properties would be a better name? That way, you leave the higher-level Regexp::Thai name free...maybe for a later module that loads all the Regexp::Thai::<whatever> modules that you and others have eventually contributed.

    And, at the same time, you provide a good naming pattern for others to follow. Perhaps later there will be a Regexp::Thai::Debug, or a Regexp::Thai::Transliterate, or a Regexp::Thai::Common, etc. By creating the namespace, but not pre-empting it, you may eventually encourage a larger, richer and more consistently named ecosystem.

    Damian

      Unfortunately, there seems to already be some significant namespace pollution with reference to Thai-language routines. The two-digit code for the language is "TH" and this has been picked up and used as an abbreviation by at least one module contributor, who has contributed many modules using the name, as an abbreviation for "type handler" (or so it seems). One module contributor appears even to have used the full word "Thai" to indicate the concept of light weight (is that an allusion to boxing?), and the module has nothing to do with the Thai language as far as I could tell, even having looked at the code. So, though there are hundreds of modules that can be found in searching for "Thai" or "TH" on CPAN, only about five actually have anything to do with Thai. We are so far from having any tools in Thai, that it would be a wonder if we could ever run out of its namespace in my lifetime.

      Most new programmers over here are learning PHP and Java. I wish we did have more who would interest themselves in Perl, and I am trying to interest young people to take it up whenever I have an opportunity. Meanwhile, we have next to nothing.

      One of the features I was thinking to add would be a Romanization subroutine, which transliterates the Thai to a Roman alphabet, as you alluded to. However, that is not strictly a Regexp issue in any case. The module as it stands has only a few more Regexp-related routines which might be included to make it about as complete as is possible for the Thai language, as I see it. Then what? The regexp engine itself is part of Perl, and works just fine with the addition of these "hooks" that we are adding in this module. This module is so "core" in extending that capability, that it can hardly be more basic than it is. Any additional module might be added to it, and I wouldn't mind at all extending coauthorship to someone else who wishes to help, perhaps with such additions as:

      • Regexp::Thai::LongStrings
      • Regexp::Thai::Romanize
      • Regexp::Thai::Assemble (assuming that the current Regexp::Assemble would not accommodate Thai)
      • etc.

      Blessings,

      ~Polyglot~

        The previous discussion has some good advice.

        The CPAN guideline saying Unicode:: is off-limits applies to CPAN, but if you think your module belongs in CORE then you should email the perl5-porters first and find out.

        So the steps are:

        1) look at the existing Unicode:: modules in CORE and decide if your m +odule belongs in CORE (ask p5p if necessary) 2) if not, pick a namespace and name 3) decide how to generate the CPAN boilerplate files (I use makemaker) + there is a non-trivial amount of work to make a nice CPAN module these days! 4) add some tests 5) try your distro on different machines and when you're happy request + a CPAN account and upload it. 6) wait for CPAN testers to score it and fix it 7) bask in the glory of being a CPAN contributor, along with the other + 10,000 members! :)

        Later, James.

Re: Namespace/advice for new CPAN modules for Thai & Lao
by Anonymous Monk on Mar 23, 2015 at 01:32 UTC

    I don't think there's one "correct" answer... If your modules mostly deal with Unicode issues (like properties), then perhaps the Unicode:: namespace might be appropriate. If your modules mostly provide regexes or extend regexes, Regexp:: does seem appropriate. If your modules provide a mix of features, but they're language-specific, then Lingua:: seems like a good place. The example you've shown seems like it might fit into Unicode::, but it also depends on the other modules in the distro.

    As for exporting, if you've got a module that only exports functions named like the example you showed, then automatically exporting all of those is probably not really that bad, since they're probably unlikely to collide with existing functions. Then again, "modern" Perl modules generally don't do that so they don't flood the user's namespace, and Laurent_R is right that adding an :all tag is pretty easy:

    use Exporter 'import'; our @EXPORT_OK = qw/ ... /; our %EXPORT_TAGS = ( all => \@EXPORT_OK );

    (Note the use Exporter 'import'; instead of adding Exporter to @ISA, this prevents your module from inheriting several other Exporter functions and changing your module's @ISA, which may be important modules that also offer an OO interface.)

      Oops: Last paragraph should say "important for modules"... s/(?<=important )(?=modules)/for /

Re: Namespace/advice for new CPAN modules for Thai & Lao
by Anonymous Monk on Mar 23, 2015 at 08:11 UTC

    First thoughts, see Encode::Encoding, Creating (and using) a custom encoding., Re^2: Creating (and using) a custom encoding. (fudge :encoding(rot13))

    Seconds thoughts, sub In[A-Z]\w+ · CPAN->grep
    Encode-JP-Emoji-0.60/lib/Encode/JP/Emoji/Mapping.pm
    Encode-InCharset-0.03/InCharset/8859_1.pm
    Encode-JP-Mobile-0.30/lib/Encode/JP/Mobile.pm
    Lingua-JA-Moji-0.36/lib/Lingua/JA/Moji.pm
    Sub::CharacterProperties - Support for user-defined character properties
    Encode::InCharset - defines \p{InCharset}

    pod thoughts,  =head1 NAME  Thai -  useful ch section should match package Regexp::Thai that is should be Module::Name - module description

    name thoughts, say no to "Regexp::Thai", say no to Regexp, stick to Encode or Lingua ... consider vanity naming ... Encode::InCharset::Polygot::Thai, Encode::Th::PolygotProperties... whatever makes the most sense with how your contribution improves the situation regarding Thai (is it generically useful or just for your program?)

    Maybe ask on http://prepan.org/

      I'd like to ask at prepan, but they seem to be rather elite, only permitting one to login via twitter or github, neither of which I have. And I am not about to start up with twitter, so I may have to forego the prepan experience.

      You may not have grasped the utility of the module I'm proposing. It is not an encoding issue. It has nothing, actually, to do with encoding. It does operate within the auspices of Unicode, being specifically designed for UTF-8. But it does not convert anything, it simply identifies what is already there. Basically, it adds tokens to the regexp engine so that additional characters can be recognized within the Thai/Lao language. For example, in English, you can identify a space in a number of ways:

      • /[ ]/
      • /\s/
      • /\p{IsSpace}/

      If you didn't have those options, you would be unable to find a space for your regexp to work with. English regexes can distinguish between alphabetical (word) characters (\w) and numerical digits (\d), etc. Until now, there is no way to do this in Thai or Lao. My modules are providing these tools for Thai and Lao so that the language can be more readily parsed via Regexp. What I'm really doing with this module is adding character classes to the standard Unicode properties, as can be found listed on pp. 167 - 175 in Programming Perl, 3rd Edition.

      I certainly appreciate your input, but I don't see much of a direct relationship between my module and the Encode:: line of tools. With all due respect, this topic has frustrated me. I had expected a little more unanimity among the various responses, but I have discovered that everyone has a different perspective. At this point, it appears that no matter what I might choose, it has a good chance of displeasing the majority. That's not a fun position to see oneself in.

      Blessings,

      ~Polyglot~

        ... or github, neither of which I have.

        Are you using Git or another VCS? Because if not, it's a really good idea, and Git / GitHub makes collaboration much easier.

        I had expected a little more unanimity among the various responses, but I have discovered that everyone has a different perspective. At this point, it appears that no matter what I might choose, it has a good chance of displeasing the majority.

        It's a community, not a centrally governed system with strict rules :-) Releasing useful, tested code publicly already gives you a lot of points, so it's unlikely to upset anyone unless you do so without thought in a top-level namespace; and it seems like you're putting a whole lot of thought into it. If you want to play it safe, start off in an X::Y::Z namespace; I think Thoughtstream gave some good advice in that respect above.

        good chance of displeasing the majority
        Well, there hopefully is a difference between not pleasing them and explicitly displeasing.
        That said, what about Regexp::UTF8::Thai (Update: or Regexp::Thai::UTF8)?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1120907]
Approved by kevbot
Front-paged by kevbot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (9)
As of 2024-03-28 09:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found