|Problems? Is your data what you think it is?|
Modern best practices for multilingual regexp alphabetical character matching?by dmorgo (Pilgrim)
|on Jan 12, 2009 at 20:55 UTC||Need Help??|
dmorgo has asked for the
wisdom of the Perl Monks concerning the following question:
I slipped into a deep meditation for the last several months, and neglected to stay informed about the most recent best practices for pattern matching of alphabetic characters in multiple languages. Thus I have returned and am seeking any enlightenment you may be able to provide.
In English, one can say:
or if you don't want underscores:
Then I seem to recall this is supported, but maybe not on older Perls (not a problem; I have a newish Perl):
I'm sure it's a FAQ, but I'm looking for the latest up-to-date best practices on this FAQ. The question is: will the above work for any alphabetical language? Or does it only work for the language of my current locale setting (whatever that is - I've been fuzzy on that ever since learning that one legal setting for locale is 'C' -- odd).
Or should I hand-construct regular expressions for each language using a list of characters from that language?
The data I'm working with will all be UTF-8, if that makes a difference.