in reply to Windows-1252 characters from \x{0080} thru \x{009f}
tye has covered most of the important stuff. I'll just add that in order for your first code snippet to DWYM, it would have to go something like this (note the addition of "use Encode", setting the io layer on STDOUT, and applying "decode" to the literals being assigned to @words):
When I run that in a terminal that is using cp1252 (aka "Windows Latin1"), the resulting output is:#!perl use strict; use warnings; use Encode; binmode STDOUT, ":encoding(cp1252)"; my $pattern = qr/\A\w+\z/; my @words = map { decode( "cp1252", $_ ) } qw( Tšekissä Žena Śdipus +Rex ); for my $word (@words) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; }
UPDATE: To clarify, the point here is that when it comes to matching things outside the ASCII range, regex expressions like '\w' will only employ unicode semantics, not cp1252 or any other semantics, so they need to operate on strings that have their perl-internal-utf8 flag set to true (i.e. have been decoded from "external" forms, whether by reading through the appropriate io layer, or by explicit decoding).The word "Tšekissä" matches the pattern (?-xism:\A\w+\z) The word "Žena" matches the pattern (?-xism:\A\w+\z) The word "Śdipus" matches the pattern (?-xism:\A\w+\z) The word "Rex" matches the pattern (?-xism:\A\w+\z)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Windows-1252 characters from \x{0080} thru \x{009f}
by Jim (Curate) on Apr 19, 2012 at 05:34 UTC | |
by Anonymous Monk on Apr 19, 2012 at 06:21 UTC |
In Section
Seekers of Perl Wisdom