http://www.perlmonks.org?node_id=965830


in reply to Windows-1252 characters from \x{0080} thru \x{009f}

tye has covered most of the important stuff. I'll just add that in order for your first code snippet to DWYM, it would have to go something like this (note the addition of "use Encode", setting the io layer on STDOUT, and applying "decode" to the literals being assigned to @words):
#!perl use strict; use warnings; use Encode; binmode STDOUT, ":encoding(cp1252)"; my $pattern = qr/\A\w+\z/; my @words = map { decode( "cp1252", $_ ) } qw( Tšekissä Žena Śdipus +Rex ); for my $word (@words) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; }
When I run that in a terminal that is using cp1252 (aka "Windows Latin1"), the resulting output is:
The word "Tšekissä" matches the pattern (?-xism:\A\w+\z) The word "Žena" matches the pattern (?-xism:\A\w+\z) The word "Śdipus" matches the pattern (?-xism:\A\w+\z) The word "Rex" matches the pattern (?-xism:\A\w+\z)
UPDATE: To clarify, the point here is that when it comes to matching things outside the ASCII range, regex expressions like '\w' will only employ unicode semantics, not cp1252 or any other semantics, so they need to operate on strings that have their perl-internal-utf8 flag set to true (i.e. have been decoded from "external" forms, whether by reading through the appropriate io layer, or by explicit decoding).

Replies are listed 'Best First'.
Re^2: Windows-1252 characters from \x{0080} thru \x{009f}
by Jim (Curate) on Apr 19, 2012 at 05:34 UTC

    Thank you very much, graff. Your reply filled in the all-import How-do-you-do-it? gap.

    Jim