http://www.perlmonks.org?node_id=735804

dmorgo has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I slipped into a deep meditation for the last several months, and neglected to stay informed about the most recent best practices for pattern matching of alphabetic characters in multiple languages. Thus I have returned and am seeking any enlightenment you may be able to provide.

In English, one can say:

if (/^(\w+)$/) { print "found [$1]\n"; }

or if you don't want underscores:

if (/^([A-Za-z]+)$/) { print "found [$1]\n"; }

Then I seem to recall this is supported, but maybe not on older Perls (not a problem; I have a newish Perl):

if (/^([[:alpha:]]+)$/) { print "found [$1]\n"; }

I'm sure it's a FAQ, but I'm looking for the latest up-to-date best practices on this FAQ. The question is: will the above work for any alphabetical language? Or does it only work for the language of my current locale setting (whatever that is - I've been fuzzy on that ever since learning that one legal setting for locale is 'C' -- odd).

Or should I hand-construct regular expressions for each language using a list of characters from that language?

The data I'm working with will all be UTF-8, if that makes a difference.

Replies are listed 'Best First'.
Re: Modern best practices for multilingual regexp alphabetical character matching?
by ikegami (Patriarch) on Jan 12, 2009 at 21:40 UTC

    The data I'm working with will all be UTF-8, if that makes a difference.

    Make sure it's decoded using one or more of the following

    use utf8; # Treat the source code as UTF-8 use open ':std', ':locale'; # Treat STD* as per locale use open ':std', ':encoding(UTF-8)'; # Treat STD* as UTF-8 use open IO => ':encoding(UTF-8)'; # Treat files as UTF-8 by default open(my $fh, '<:encoding(UTF-8)', $qfn) # Treat a file as UTF-8 utf8::decode(my $text = $encoded_text) # Treat a string as UTF-8 or die;

    And make sure the string us stored internally as UTF-8.

    utf8::upgrade($s); # Use UNICODE semantics

    (No need to do use utf8; to use utf8:: functions. use utf8; means the source is in UTF-8.)

    If you do those two things, regexp will use UNICODE semantics, so \w and character classes will match accented letters, etc.

Re: Modern best practices for multilingual regexp alphabetical character matching?
by moritz (Cardinal) on Jan 12, 2009 at 21:56 UTC
    There are few best practices, which might or might not answer your question:
    • Don't match character ranges. You will forget some. For example is there are a good reason to match [a-zA-Z], but not all those other Latin characters out there? Unicode contains more than 100k characters. Enumerating a subset of them is bound to fail, unless you have very narrow ideas about your subset.
    • Don't match Unicode blocks. They are just organizational units, nothing that the user or programmer should ever care about
    • If you want to check for Letter, Digits etc. use the appropriate Unicode property (a list can be found in perlunicode), like \p{LowercaseLetter} or short \p{Ll} (though the longer form is probably better readable)
    • If you want to check for a script, use constructs like \p{Hiragana}.
    • Remeber that there might be diacritic markings that belong conceptually to a different script, so instead of \p{YourScript}+ you might want to check for \p{YourScript}(?:\p{Mark}|\p{YourScript})*.
    • When counting characters, use \X rather than . in regexes.

    (Disclaimer: I assume you deal with human language. For file formats or other artificial stuff it may very well be appropriate to do things that I recommended against above).

      Dear Monks,

      Sorry to introduce myself by hijacking an old thread, but I have some related questions. I am a complete beginner and this topic confuses me the most. I didn't realize the problem until I used some automatic match variables ($` $& $') and parentheses. The output encoding which was fine until then broke. Following your advice and with trial-error I found that putting :

      while (<>) { $_ = Encode::decode_utf8( $_ ); binmode STDOUT, ":utf8";
      to the input corrects the encoding. It is strange that without these lines on the input, everything "looks" fine unless I use parentheses or automatic match variables. Is the encoding wrong all the way and somehow gets corrected on the output? Or is it correct and the automatic variables and parentheses break it? Considering that I work only with utf-8 files, should I make a habit of putting these lines every time I use input?

      Best regards,

      Martin

        Everything "looks" fine until you try to extract substrings in some way. That's because without decoding your data on input the strings are handled as sequences of bytes, so a character like ä translates to two bytes.

        Now if you extract some part of string and didn't decoded it first, you can accidentally rip apart these two bytes, leaving behind encoding garbage - usually not a good idea.

        So I recommend to properly decode UTF-8 (and other character encodings) during input, and encode the strings on output. And use utf8; if you have string constants in your source code.

Re: Modern best practices for multilingual regexp alphabetical character matching?
by JavaFan (Canon) on Jan 12, 2009 at 21:03 UTC
    The data I'm working with will all be UTF-8, if that makes a difference.
    That makes a huge difference. \w will match just the 26 letters (both cases), 10 digits and the underscore if your strings aren't in UTF-8 format (unless you have a locale). Otherwise, it will match anything that's a Unicode letter, Unicode digit or underscore.

    However, since there's this dependency on whether the string you match against is in UTF-8 format or not, I'd shy away from UTF-8. Instead, use \p{L} which will match any character the Unicode standard says is a letter. So, you'd get:

    if (/^(\p{L}+)$/) {say "found [$1]"}