hotshot has asked for the wisdom of the Perl Monks concerning the following question:

ho guys!

I'm checking the overhead of supporting unicode in my Perl project, as I managed to see till now, without using any unicode module (utf8), Perl just "gives what she gets", for example when I used opendir to get dirs list under a given directory and I have there dirs opened in korean or german language (in utf8), perl receives it and displays it properly.

The problem starts when I try to manipulate the directory with a regular expression. does it mean I'll have to change all my regexps (endless regexps) to support unicode (using IsAlnum and '-' for \w for example), the regexps will be much complicated (long), and won't have all the power of old ones?


Edited: ~Wed Oct 30 16:38:08 2002 (GMT) by footpad: Retitled (was Unicode), added <P> tags, and fixed minor spelling errors - per Consideration

Replies are listed 'Best First'.
Re: Unicode and regexes
by dakkar (Hermit) on Oct 30, 2002 at 17:51 UTC

    The regexps, per se, don't need any change (I'm assuming Perl 5.8.0, since 5.6.x had some problems). You need to assure two things:

    1. that your strings are correctly encoded
    2. that Perl knows it

    The first is a problem in itself, but a bit off-topic.

    The second can be done in two ways:

    1. if the strings come from a filehandle, you can use something like open(FH, "<:utf8", "file") to tell Perl to treat data as utf-8 (or use the :encoding layer, see perldoc -f open
    2. otherwise (such as your example, from a dirhandle), use Encode; and $string=Encode::decode("utf-8",$string);
      and if I still use Perl 5.6.1?