Problems? Is your data what you think it is? | |
PerlMonks |
Re^9: UTF8 versus \w in pattern matchingby haj (Vicar) |
on Jul 06, 2021 at 18:21 UTC ( [id://11134715]=note: print w/replies, xml ) | Need Help?? |
It doesn't look like UTF-8 because it isn't supposed to look like UTF-8. It has nothing to do with legacy 8-bit. Data::Dumper shows Unicode codepoints and not encodings. If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is U, then it is UTF-8, if it is 1, then it is ISO-8859-1. You can enforce the encoding in Emacs with C-x RET f ISO-8859-1 RET. If you execute the file in this encoding, Perl will croak because you said use utf8; and your source code isn't valid UTF-8. If you then omit use utf8; with ISO-8859-1 encoding and run the file, you'll get $VAR1 = 't�n'; because now it is your Terminal which expects UTF-8 and gets an 8-bit character. If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón'; I don't recommend Data::Dumper for such diagnostics because it might, or might not use \x{} notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.
In Section
Seekers of Perl Wisdom
|
|