note
graff
[tye] has covered most of the important stuff. I'll just add that in order for your first code snippet to DWYM, it would have to go something like this (note the addition of "use Encode", setting the io layer on STDOUT, and applying "decode" to the literals being assigned to @words):
<c>
#!perl
use strict;
use warnings;
use Encode;
binmode STDOUT, ":encoding(cp1252)";
my $pattern = qr/\A\w+\z/;
my @words = map { decode( "cp1252", $_ ) } qw( Tšekissä Žena Śdipus Rex );
for my $word (@words) {
my $result = $word =~ $pattern ? "matches" : "doesn't match";
printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pattern;
}
</c>
When I run that in a terminal that is using cp1252 (aka "Windows Latin1"), the resulting output is:
<c>
The word "Tšekissä" matches the pattern (?-xism:\A\w+\z)
The word "Žena" matches the pattern (?-xism:\A\w+\z)
The word "Śdipus" matches the pattern (?-xism:\A\w+\z)
The word "Rex" matches the pattern (?-xism:\A\w+\z)
</c>
UPDATE: To clarify, the point here is that when it comes to matching things outside the ASCII range, regex expressions like '\w' will only employ unicode semantics, not cp1252 or any other semantics, so they need to operate on strings that have their perl-internal-utf8 flag set to true (i.e. have been decoded from "external" forms, whether by reading through the appropriate io layer, or by explicit decoding).
965817
965817