Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Modern best practices for multilingual regexp alphabetical character matching?

by moritz (Cardinal)
on Jan 12, 2009 at 21:56 UTC ( #735818=note: print w/replies, xml ) Need Help??

in reply to Modern best practices for multilingual regexp alphabetical character matching?

There are few best practices, which might or might not answer your question:
  • Don't match character ranges. You will forget some. For example is there are a good reason to match [a-zA-Z], but not all those other Latin characters out there? Unicode contains more than 100k characters. Enumerating a subset of them is bound to fail, unless you have very narrow ideas about your subset.
  • Don't match Unicode blocks. They are just organizational units, nothing that the user or programmer should ever care about
  • If you want to check for Letter, Digits etc. use the appropriate Unicode property (a list can be found in perlunicode), like \p{LowercaseLetter} or short \p{Ll} (though the longer form is probably better readable)
  • If you want to check for a script, use constructs like \p{Hiragana}.
  • Remeber that there might be diacritic markings that belong conceptually to a different script, so instead of \p{YourScript}+ you might want to check for \p{YourScript}(?:\p{Mark}|\p{YourScript})*.
  • When counting characters, use \X rather than . in regexes.

(Disclaimer: I assume you deal with human language. For file formats or other artificial stuff it may very well be appropriate to do things that I recommended against above).

Replies are listed 'Best First'.
Re^2: Modern best practices for multilingual regexp alphabetical character matching?
by mea (Initiate) on Jan 26, 2009 at 00:22 UTC

    Dear Monks,

    Sorry to introduce myself by hijacking an old thread, but I have some related questions. I am a complete beginner and this topic confuses me the most. I didn't realize the problem until I used some automatic match variables ($` $& $') and parentheses. The output encoding which was fine until then broke. Following your advice and with trial-error I found that putting :

    while (<>) { $_ = Encode::decode_utf8( $_ ); binmode STDOUT, ":utf8";
    to the input corrects the encoding. It is strange that without these lines on the input, everything "looks" fine unless I use parentheses or automatic match variables. Is the encoding wrong all the way and somehow gets corrected on the output? Or is it correct and the automatic variables and parentheses break it? Considering that I work only with utf-8 files, should I make a habit of putting these lines every time I use input?

    Best regards,


      Everything "looks" fine until you try to extract substrings in some way. That's because without decoding your data on input the strings are handled as sequences of bytes, so a character like translates to two bytes.

      Now if you extract some part of string and didn't decoded it first, you can accidentally rip apart these two bytes, leaving behind encoding garbage - usually not a good idea.

      So I recommend to properly decode UTF-8 (and other character encodings) during input, and encode the strings on output. And use utf8; if you have string constants in your source code.

        Thanks for the answer.

        So basically I am safe as long as I use these two lines, and "use utf8;" on top of my script every time. This has been really the most confusing thing so far, the otherwise excellent "Learning Perl" doesn't mention these problems at all, and some of the examples don't work correctly with utf-8 characters. which is fine for English speaking beginners, but people working on other languages have to deal with this issue right from the start. Could have saved a lot of time if it mentioned simply "for non-English languages or utf-8 add this to your script". Well, at least now I know and can go back to learn the "proper" stuff... Thanks again,

        Best Regards,


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://735818]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2017-07-24 00:48 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (347 votes). Check out past polls.