http://www.perlmonks.org?node_id=376768

december has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow seekers of enlightenment,

I'm trying to construct a simple regex that checks if a variable contains characters valid in a unix path. The regex works as it should when there are no umlauts in the string, but when testing different inputs, I noticed it refuses to match any umlauts. What bugs me, is that it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string. A shortened example:

$testString = "/usr/home/december/public_html/experiments/html/files/b +lëh.txt"; $fileAsked = $ENV{'PATH_TRANSLATED'}; print "Trying with: $testString\n"; print "Trying with: $fileAsked\n"; print "VALID1\n" if ($testString =~ /^([\w\s\/.]+)$/); print "VALID2\n" if ($fileAsked =~ /^([\w\s\/.]+)$/); print "SUCCEEDED1\n" if (utf8::upgrade($testString)); print "SUCCEEDED2\n" if (utf8::upgrade($fileAsked)); print "VALID3\n" if ($testString =~ /^([\w\s\/.]+)$/); print "VALID4\n" if ($fileAsked =~ /^([\w\s\/.]+)$/);

prints:

Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt SUCCEEDED1 SUCCEEDED2 VALID3

Note that both strings and regex's are exactly the same, but after conversion, one matches, and the other doesn't. I suspect some utf8 problems, or a wrong charset used for \w. Perl version is 5.8.3.

How do I make the \w match umlauts consistently? Do I need to set a locale even for utf8? This behavior doesn't seem logical to me.

Replies are listed 'Best First'.
Re: problems matching umlauts in env vars
by borisz (Canon) on Jul 23, 2004 at 01:43 UTC
    Your problem is IMHO, that you locale is already in utf8. This means that your env var is in utf8 but your $testString is in latin1. If this is the case you need to use Encode;. And update the bytes from your environment to utf.
    $fileAsked = Encode::decode(utf8 => $ENV{'aa'});
    Boris

      It's not in utf. When I encode it like you told me, it can't find the file anymore (the ë now changed to an utf sequence when I print the variable, and the filesystem doesn't like utf filenames). Both strings seem to be iso-8859-1, probably just plain 8bit. When I update it to utf, though, it's the DOT it doesn't want to match on - not the umlaut.

      /^([\w\s.]+)$/ # unescaped /^([\w\s\.]+)$/ # or escaped

      Is there something wrong with the dot in the regex?

        No, inside the [] a . is the same as \.
        Boris
Re: problems matching umlauts in env vars
by allolex (Curate) on Jul 23, 2004 at 07:20 UTC

    You need to define a locale that contains ä/ö/ü for \w to include them. You need to do this even for UTF-8. UTF-8 is just a standard way of representing characters, not the set of characters that can make up words in a particular language.

    use locale; use POSIX 'locale_h'; my $loc = 'de_DE.utf8'; # German locale, for example. Run 'locale -a' + to get the exact locale name setlocale(LC_CTYPE, $loc) or die "Invalid locale $loc";

    Either that, or use this little trick off of my home node: [A-Za-zÀ-ÿœŒ] instead of \w :)

    I probably should add that the German locale will likely not match 'ë', since it does not exist in German. Maybe Dutch or French...

    --
    Damon Allen Davison
    http://www.allolex.net

      Thanks for your reply. I have set the locale now, and that solves at least this problem.

      German locale should be using the iso-8859-1 (or rather iso-8859-15) charset, which does contain an e with umlauts. Standard French language doesn't have umlauts, but Dutch (my native language) does. Either way, all Western European countries use the same charset, which should be iso-8859-15 (that's latin1 plus euro).

      The problem now is that I don't know which charset will be given to me in the request... Could be pretty much anything.

Re: problems matching umlauts in env vars
by beable (Friar) on Jul 23, 2004 at 01:25 UTC
    Here are the results I got:
    Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt Trying with: /usr/home/december/public_html/experiments/html/files/blë +h.txt SUCCEEDED1 SUCCEEDED2 VALID3 VALID4

    I'd suggest you add a line like this as the third line of your program, to check that the strings are the same:

    die "strings are different!" if ($testString ne $fileAsked);
      They are equal (tested with 'eq'). Yet one matches, and the other doesn't - it's the same regex. How bizar. >:-|
Re: problems matching umlauts in env vars
by graff (Chancellor) on Jul 24, 2004 at 08:29 UTC
    You said:
    it does match the exact same string when I use a variable, but not when handed down by $ENV{'PATH_TRANSLATED'} - which probably is a non-encoded 8bit string.
    (emphasis added). Meanwhile, the docs for "utf8" have this to say about the "upgrade" call:
    Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that.
    So, if your environment variable's value is actually set via some single-byte European character encoding ("Latin1"), then just passing it to utf8::upgrade amounts to just calling it utf8 when it really is not. The upgrade call returns the number of octets in the "converted" string (which doesn't really get converted -- it just gets it utf8 flag turned on, I think). So you'll get a non-zero return unless the string is completely empty.

    (I confess I'm a bit confused by the docs for "utf8::upgrade" -- especially its behavior wrt "characters in the range 0x80-0xFF". There are odd things about this range and its treatment in perl 5.8 that I still need to understand better.)

    Anyway, try this:

    use Encode; # ... $fileAsked = decode( "iso8859-1", $fileAsked);
    and then see whether "VALID4" shows up. Check the Encode man page for more options (e.g. trapping character conversion failures using eval).

      Yeah, it does, thanks. I don't understand how perl succeeds in converting the 8bit string in PATH_TRANSLATED to the one given in decode though... I've tested several browsers; some send utf-8, and some seem to send iso-8859-1 or iso-5589-15. The $fileAsked variable could be in any charset, really.

      Oh well... I hope those internationalisation issues will solve itself in the next couple of years, not just for Perl, but for all software really.