dmorgo has asked for the wisdom of the Perl Monks concerning the following question:
I slipped into a deep meditation for the last several months, and neglected to stay informed about the most recent best practices for pattern matching of alphabetic characters in multiple languages. Thus I have returned and am seeking any enlightenment you may be able to provide.
In English, one can say:
if (/^(\w+)$/) { print "found [$1]\n"; }
or if you don't want underscores:
if (/^([A-Za-z]+)$/) { print "found [$1]\n"; }
Then I seem to recall this is supported, but maybe not on older Perls (not a problem; I have a newish Perl):
if (/^([[:alpha:]]+)$/) { print "found [$1]\n"; }
I'm sure it's a FAQ, but I'm looking for the latest up-to-date best practices on this FAQ. The question is: will the above work for any alphabetical language? Or does it only work for the language of my current locale setting (whatever that is - I've been fuzzy on that ever since learning that one legal setting for locale is 'C' -- odd).
Or should I hand-construct regular expressions for each language using a list of characters from that language?
The data I'm working with will all be UTF-8, if that makes a difference.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Modern best practices for multilingual regexp alphabetical character matching?
by ikegami (Patriarch) on Jan 12, 2009 at 21:40 UTC | |
Re: Modern best practices for multilingual regexp alphabetical character matching?
by moritz (Cardinal) on Jan 12, 2009 at 21:56 UTC | |
by mea (Initiate) on Jan 26, 2009 at 00:22 UTC | |
by moritz (Cardinal) on Jan 26, 2009 at 07:27 UTC | |
by mea (Initiate) on Jan 26, 2009 at 09:59 UTC | |
Re: Modern best practices for multilingual regexp alphabetical character matching?
by JavaFan (Canon) on Jan 12, 2009 at 21:03 UTC |