Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
I just signed up on PAUSE, and am finally willing to submit my first modules. Everyone recommends that newbies get advice on how to do this, especially as pertains the naming of the modules, so I present the matter here for your inspection. Having tried to research the matter, I'm a little conflicted about which category these modules would best fit, so your advice is much appreciated.
The synopsis is that these are very basic modules with respect to handling of the Thai/Lao character sets. In Thai and in Lao, each character/codepoint can have one or more categorizations, like vowel/consonant and uppercase/lowercase in English, but more complex. The current Unicode pragma available allows only the /\p{InThai}/ method of identification, so my module expands the regexp tokens to include such as:
- \p{InThaiCons} (consonants)
- \p{InThaiLCons} (low-class consonants)
- \p{InThaiMCons} (mid-class consonants)
- \p{InThaiHCons} (high-class consonants)
- \p{InThaiVowel} (all possible vowels)
- \p{InThaiPreVowel} (vowels that precede their consonants)
- etc. (see more in code below)
This is a module that will be useful in any textual manipulation, such as word/syllable identification or splitting (words are not normally split with whitespace as in English). It is a very simple module, whose features may be amended/augmented in the future with some additional capability, but whose present utility is readily apparent.
Now, for the code example....I'll present the Thai one, but the Lao is nearly the same, but on the Lao charset.
package Regexp::Thai::CharClasses;
use 5.008003;
use strict;
use warnings;
require Exporter;
our $VERSION = '1.01';
our @ISA = qw(Exporter);
our @EXPORT = qw(
InThai InThaiCons InThaiHCons InThaiMCons InThaiLCons InThaiVowel
InThaiPreVowel InThaiPostVowel InThaiCompVowel InThaiDigit InThaiTon
+e
InThaiPunct
);
=head1 NAME
Regexp::Thai::CharClasses - useful character properties for Unicode T
+hai
=head1 SYNOPSIS
use Regexp::Thai::CharClasses;
$char = "..."; # some UTF8 string
$char =~ /\p{InThaiCons}/; # match a Thai consonant
$char =~ /\p{InThaiTone}/; # match a Thai tone mark
# see description for full set of terms
=head1 DESCRIPTION
This module supplements the Unicode character-class definitions with
special groups relevant to Thai linguistics. The following classes
are defined:
=over 4
=item InThai
Matches ALL characters in the Thai unicode code-point range.
=item InThaiCons
Matches Thai consonant letters, leaving out vowels, numerics, tone mar
+ks, etc.
=item InThaiVowel
Matches Thai vowels, including compounded and free-standing vowels.
NOTE: Exceptions here include several of the "consonants" which also s
+erve
as vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen.
+
These are included as vowels in this grouping to accept the widest pos
+sible
definition, but cannot with certainty be determined by this to be in u
+se
as actual vowels in the instance of their identification here.
=item InThaiAlpha
Matches only the Thai alphabetic characters (consonants and vowels),
excluding all digits, tone marks, and punctuation marks.
=item InThaiTone
Matches only the Thai tone marks, leaving out all letters,
digits and punctuation marks.
=item InThaiPunct
Matches Thai punctuation characters, not including tone marks,
white space, digits or alphabetic characters, and not including
non-Thai punctuation marks (such as English [.,'"!?] etc.).
=item InThaiCompVowel
Matches only the Thai vowels which are compounded with a Thai consonan
+t,
and matching only the vowel portion of the compounded character.
=item InThaiPreVowel
Matches only the subset of vowels which appear _before_ the consonant
with which they are associated (though in Thai they are sounded _after
+_
said consonant); this excludes all consonant-vowels and does not inclu
+de
any of the compounded vowels.
=item InThaiPostVowel
Matches only the vowels which appear _after_ the consonant with which
they are associated; this excludes all consonant-vowels and does not
include any of the compounded vowels.
=item InThaiHCons
Matches high-class Thai consonants.
=item InThaiMCons
Matches middle-class Thai consonants.
=item InThaiLCons
Matches low-class Thai consonants.
=item InThaiDigit
Matches Thai numerical digits only.
=back
=cut
sub InThai {
return <<'END';
0E01 0E5B
END
}
sub InThaiCons {
return <<'END';
0E01 0E2E
END
}
sub InThaiVowel {
return join "\n",
'0E30 0E45',
'0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or)
'0E4D',
'0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English
+)
'0E2D',#Thai consonant or-ang can also be a vowel
'0E27',#Thai consonant wo-wen is only a vowel following mai han-akat
}
sub InThaiAlpha {
return <<'END';
0E01 0E2E
0E30 0E45
0E47
0E4D
0E22
0E2D
0E27
END
}
sub InThaiTone {
return <<'END';
0E48 0E4B
END
}
sub InThaiPunct {
return <<'END';
0E46
0E4C
0E4E
0E4F
0E5A
0E5B
END
}
sub InThaiCompVowel {
return join "\n",
'0E31',#Thai mai han-akat
'0E34',#Thai sara-i
'0E35',#Thai sara-ii
'0E36',#Thai sara-ue
'0E37',#Thai sara-uee
'0E38',#Thai sara-u
'0E39',#Thai sara-uu
'0E3A',#Thai phinthu
'0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or)
}
sub InThaiPreVowel {
return <<'END';
0E40 0E44
END
}
sub InThaiPostVowel {
return <<'END';
0E45
0E30
0E32
0E33
END
}
sub InThaiHCons {
return <<'END';
0E02
0E03
0E09
0E10
0E16
0E1C
0E1D
0E28
0E29
0E2A
0E2B
END
}
sub InThaiMCons {
return <<'END';
0E01
0E08
0E0E
0E0F
0E14
0E15
0E1A
0E1B
0E2D
END
}
sub InThaiLCons {
return <<'END';
0E04 0E07
0E0A 0E0D
0E11 0E13
0E17 0E19
0E1E 0E27
0E2C
0E2E
END
}
sub InThaiDigit {
return <<'END';
0E50 0E59
END
}
=head1 AUTHOR
Erik Mundall
=head1 COPYRIGHT
Copyright (C) 2015 Erik Mundall. All Rights Reserved.
This is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.
=cut
1;
For names, I've considered Lingua and some others, but this is so directly Regexp related as to make me think it might better live there. I'm fully open to suggestions. As an entirely self-taught coder who is only a hobbyist at it, and a teacher by trade, I'm also open to corrections on the code itself. Regarding the "Export" feature, I know that it is deprecated to export all the functions, but I just cannot visualize the need to separate these out--like, how often would someone want to know only the vowels, and, if so, how much would be gained by specifying only such? The added complexity, versus the matter of namespace, seems to my mind to be a net disadvantage considering the namespace here is very specific as it is and unlikely to present a problem. Yet I will readily listen to those of greater experience.
LATEST UPDATE:
Suggested names so far have included:
- Unicode::X::Y
- Lingua::X::Y
- Regexp::Thai::Properties
- Regexp::Thai::X
- Encode::InCharset::Polyglot::Thai
- Encode::Th::PolyglotProperties
- Regexp::UTF8::Thai
- Regexp::Thai::UTF8
- Regexp::CharProps::Thai
At this point, I've updated the name of the package above to reflect what I am most strongly leaning toward, a slight modification of the suggestions presented in the list above: Regexp::Thai::CharClasses. The floor is still open for suggestions.
Thank you for your help.
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Laurent_R (Canon) on Mar 22, 2015 at 22:50 UTC
|
I am really not a specialist, but just my 2-cents. On the name exporting question, I think there would probably a number of cases where you would need only some of your regex categories (say, just numbers, InThaiDigit, or perhaps InThaiPunct) without any need for others.
You could have an 'all' group of names, to be used with something like this:
use Regexp::Thai ':all';
when you want to import the whole shebang.
I think it is a little bit cleaner to do it this way, and it might probably a bit easier to manage if you add new features in the future.
Having said that, this may not be so important. The user can always do something like this:
use Regexp::Thai ();
use Regexp::Thai (the_specific_function_that_I_need);
to prevent unwanted imports.
In the modules that I wrote, I usually only exported automatically only the functions that are absolutely needed for the rest of the module to work properly (for example, the init function), which must be called for any other function of the module to work properly.
But again, these are just my 2-cents, I am really not an expert on this subject.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Thoughtstream (Novice) on Mar 23, 2015 at 05:02 UTC
|
In naming a module like this, which improves the core features of Perl, it can help to think ahead to what else you plan to do with it, and what other modules by other authors you imagine might coexist in the namespace.
For example, if you choose Regexp::Thai, then you are basically accepting responsibility for everything Thai-related in regexes. When someone else wants to add some other feature for using Thai with regexes, they'll now have to find another, less appropriate name. Or, at least, use a sub-namespace, which may be confusing to users when their module is (in most respects) unrelated to yours.
Or, when you want to add other Thai-related regex features, you're going to have to expand that same module (because it has already "claimed" the general name). Those extra, perhaps only loosely related features, will complicate the module's interface, and make it "heavier" to load for your existing users too.
So perhaps Regexp::Thai::Properties would be a better name? That way, you leave the higher-level Regexp::Thai name free...maybe for a later module that loads all the Regexp::Thai::<whatever> modules that you and others have eventually contributed.
And, at the same time, you provide a good naming pattern for others to follow. Perhaps later there will be a Regexp::Thai::Debug, or a Regexp::Thai::Transliterate, or a Regexp::Thai::Common, etc. By creating the namespace, but not pre-empting it, you may eventually encourage a larger, richer and more consistently named ecosystem.
Damian
| [reply] [Watch: Dir/Any] |
|
Unfortunately, there seems to already be some significant namespace pollution with reference to Thai-language routines. The two-digit code for the language is "TH" and this has been picked up and used as an abbreviation by at least one module contributor, who has contributed many modules using the name, as an abbreviation for "type handler" (or so it seems). One module contributor appears even to have used the full word "Thai" to indicate the concept of light weight (is that an allusion to boxing?), and the module has nothing to do with the Thai language as far as I could tell, even having looked at the code. So, though there are hundreds of modules that can be found in searching for "Thai" or "TH" on CPAN, only about five actually have anything to do with Thai. We are so far from having any tools in Thai, that it would be a wonder if we could ever run out of its namespace in my lifetime.
Most new programmers over here are learning PHP and Java. I wish we did have more who would interest themselves in Perl, and I am trying to interest young people to take it up whenever I have an opportunity. Meanwhile, we have next to nothing.
One of the features I was thinking to add would be a Romanization subroutine, which transliterates the Thai to a Roman alphabet, as you alluded to. However, that is not strictly a Regexp issue in any case. The module as it stands has only a few more Regexp-related routines which might be included to make it about as complete as is possible for the Thai language, as I see it. Then what? The regexp engine itself is part of Perl, and works just fine with the addition of these "hooks" that we are adding in this module. This module is so "core" in extending that capability, that it can hardly be more basic than it is. Any additional module might be added to it, and I wouldn't mind at all extending coauthorship to someone else who wishes to help, perhaps with such additions as:
- Regexp::Thai::LongStrings
- Regexp::Thai::Romanize
- Regexp::Thai::Assemble (assuming that the current Regexp::Assemble would not accommodate Thai)
- etc.
| [reply] [Watch: Dir/Any] |
|
The previous discussion has some good advice.
The CPAN guideline saying Unicode:: is off-limits applies to CPAN, but if you think
your
module belongs in CORE then you should email the
perl5-porters first and find out.
So the steps are:
1) look at the existing Unicode:: modules in CORE and decide if your m
+odule belongs in CORE (ask p5p if necessary)
2) if not, pick a namespace and name
3) decide how to generate the CPAN boilerplate files (I use makemaker)
+ there is a non-trivial amount of work to make a nice CPAN module
these days!
4) add some tests
5) try your distro on different machines and when you're happy request
+ a CPAN account and upload it.
6) wait for CPAN testers to score it and fix it
7) bask in the glory of being a CPAN contributor, along with the other
+ 10,000 members! :)
Later, James.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Anonymous Monk on Mar 23, 2015 at 01:32 UTC
|
I don't think there's one "correct" answer... If your modules mostly deal with Unicode issues (like properties), then perhaps the Unicode:: namespace might be appropriate. If your modules mostly provide regexes or extend regexes, Regexp:: does seem appropriate. If your modules provide a mix of features, but they're language-specific, then Lingua:: seems like a good place. The example you've shown seems like it might fit into Unicode::, but it also depends on the other modules in the distro.
As for exporting, if you've got a module that only exports functions named like the example you showed, then automatically exporting all of those is probably not really that bad, since they're probably unlikely to collide with existing functions. Then again, "modern" Perl modules generally don't do that so they don't flood the user's namespace, and Laurent_R is right that adding an :all tag is pretty easy:
use Exporter 'import';
our @EXPORT_OK = qw/ ... /;
our %EXPORT_TAGS = ( all => \@EXPORT_OK );
(Note the use Exporter 'import'; instead of adding Exporter to @ISA, this prevents your module from inheriting several other Exporter functions and changing your module's @ISA, which may be important modules that also offer an OO interface.) | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] |
Re: Namespace/advice for new CPAN modules for Thai & Lao
by Anonymous Monk on Mar 23, 2015 at 08:11 UTC
|
First thoughts, see Encode::Encoding, Creating (and using) a custom encoding., Re^2: Creating (and using) a custom encoding. (fudge :encoding(rot13))
Seconds thoughts, sub In[A-Z]\w+ · CPAN->grep
Encode-JP-Emoji-0.60/lib/Encode/JP/Emoji/Mapping.pm
Encode-InCharset-0.03/InCharset/8859_1.pm
Encode-JP-Mobile-0.30/lib/Encode/JP/Mobile.pm
Lingua-JA-Moji-0.36/lib/Lingua/JA/Moji.pm
Sub::CharacterProperties - Support for user-defined character properties
Encode::InCharset - defines \p{InCharset}
pod thoughts, =head1 NAME Thai - useful ch section should match package Regexp::Thai that is should be Module::Name - module description
name thoughts, say no to "Regexp::Thai", say no to Regexp, stick to Encode or Lingua ... consider vanity naming ... Encode::InCharset::Polygot::Thai, Encode::Th::PolygotProperties... whatever makes the most sense with how your contribution improves the situation regarding Thai (is it generically useful or just for your program?)
Maybe ask on http://prepan.org/
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I'd like to ask at prepan, but they seem to be rather elite, only permitting one to login via twitter or github, neither of which I have. And I am not about to start up with twitter, so I may have to forego the prepan experience.
You may not have grasped the utility of the module I'm proposing. It is not an encoding issue. It has nothing, actually, to do with encoding. It does operate within the auspices of Unicode, being specifically designed for UTF-8. But it does not convert anything, it simply identifies what is already there. Basically, it adds tokens to the regexp engine so that additional characters can be recognized within the Thai/Lao language. For example, in English, you can identify a space in a number of ways:
If you didn't have those options, you would be unable to find a space for your regexp to work with. English regexes can distinguish between alphabetical (word) characters (\w) and numerical digits (\d), etc. Until now, there is no way to do this in Thai or Lao. My modules are providing these tools for Thai and Lao so that the language can be more readily parsed via Regexp. What I'm really doing with this module is adding character classes to the standard Unicode properties, as can be found listed on pp. 167 - 175 in Programming Perl, 3rd Edition.
I certainly appreciate your input, but I don't see much of a direct relationship between my module and the Encode:: line of tools. With all due respect, this topic has frustrated me. I had expected a little more unanimity among the various responses, but I have discovered that everyone has a different perspective. At this point, it appears that no matter what I might choose, it has a good chance of displeasing the majority. That's not a fun position to see oneself in.
| [reply] [Watch: Dir/Any] |
|
... or github, neither of which I have.
Are you using Git or another VCS? Because if not, it's a really good idea, and Git / GitHub makes collaboration much easier.
I had expected a little more unanimity among the various responses, but I have discovered that everyone has a different perspective. At this point, it appears that no matter what I might choose, it has a good chance of displeasing the majority.
It's a community, not a centrally governed system with strict rules :-) Releasing useful, tested code publicly already gives you a lot of points, so it's unlikely to upset anyone unless you do so without thought in a top-level namespace; and it seems like you're putting a whole lot of thought into it. If you want to play it safe, start off in an X::Y::Z namespace; I think Thoughtstream gave some good advice in that respect above.
| [reply] [Watch: Dir/Any] |
|
good chance of displeasing the majority
Well, there hopefully is a difference between not pleasing them and explicitly displeasing.
That said, what about Regexp::UTF8::Thai (Update: or Regexp::Thai::UTF8)?
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
|