utf weirdness in regex

december has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Could someone please explain the following output?

#!/usr/bin/perl -w

use strict;
use Encode;
use Data::Dumper;

my $string1 = "blëh";
my $string2 = "blëhh";
my $string3 = "blëh.txt";

$string1 = Encode::decode(utf8 => $string1);
$string2 = Encode::decode(utf8 => $string2);
$string3 = Encode::decode(utf8 => $string3);

$Data::Dumper::Useqq = 1;
print Dumper $string1, $string2, $string3;

print "matches1\n" if ($string1 =~ /^[\w\s.]+$/);
print "matches2\n" if ($string2 =~ /^[\p{Word}]+$/);
print "matches3\n" if ($string3 =~ /^[\p{L}\p{M}\p{N}.]+$/);

##### output #####
$VAR1 = "bl";
$VAR2 = "bl\x{fffd}hh";
$VAR3 = "bl\x{fffd}h.txt";
matches1
[download]

(Perl version is 5.8.4)

As far as I can see, string1 should not match but does (look at the weird Dumper output), while string2 and string3 don't match, but should.

Comment on utf weirdness in regex Download Code

Replies are listed 'Best First'.
Re: utf weirdness in regex by borisz (Canon) on Jul 23, 2004 at 08:24 UTC
Using `decode` here is very wrong. Decode is if you have a sequence that is in utf8, but perl does not know it. Your's is in latin1 and it does not convert to valid utf8. retry it with `$string1 = Encode::decode(utf8 => $string1, Encode::FB_CROAK);` [download] to convert all to valid unicode, try: `$string1 = Encode::decode(latin1 => $string1, Encode::FB_CROAK); $string2 = Encode::decode(latin1 => $string2, Encode::FB_CROAK); $string3 = Encode::decode(latin1 => $string3, Encode::FB_CROAK);` [download] Boris	[reply] [d/l] [select]
Re^2: utf weirdness in regex by december (Pilgrim) on Jul 24, 2004 at 04:35 UTC
Thanks, that looks a lot more like what I expected! In the future, I will use the CHECK argument to see if something went wrong in the conversion. I hope that will lift some of my initial confusion as to what is in which charset...	[reply]
Re^3: utf weirdness in regex by graff (Chancellor) on Jul 24, 2004 at 08:43 UTC
Of course, when using FB_CROAK as the CHECK argument, you normally want to wrap it in an eval: `my $encoding = "whatever"; my $octets = "characters in whatever encoding..."; eval '$_ = decode( $encoding, $octets, Encode::FB_CROAK )'; if( $@ ) { report_an_error(); ... }` [download]	[reply] [d/l]
Re: utf weirdness in regex by hbo (Monk) on Jul 23, 2004 at 06:06 UTC
`/^[\w\s.]+$/` is equivilant to `/^.+$/`, right? I suspect the trailing period in the class is not what you intended. No Idea about the unicode fun with $string1, though. `"Even if you are on the right track, you'll get run over if you just sit there." - Will Rogers`	[reply] [d/l] [select]
Re^2: utf weirdness in regex by december (Pilgrim) on Jul 24, 2004 at 04:28 UTC
No, it's only any_letter and spaces (I hope). It's supposed to check that a filename only consists of letters (as opposed to control characters, which I'm filtering for). The trailing period is supposed to be there for the dot in the filename. Unicode makes things hard. :)	[reply]
Re^3: utf weirdness in regex by hbo (Monk) on Jul 24, 2004 at 05:52 UTC
But the period matches any character if it isn't escaped. `/[\w\s\.]+/ # one or more word, space or period characters /[\w\s.]+/ # one or more word. space or any characters /.+/ # same as above` [download] `"Even if you are on the right track, you'll get run over if you just sit there." - Will Rogers`	[reply] [d/l]
Re^4: utf weirdness in regex by kelan (Deacon) on Jul 26, 2004 at 21:12 UTC

Back to Seekers of Perl Wisdom