Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^9: UTF8 versus \w in pattern matching

by haj (Vicar)
on Jul 06, 2021 at 18:21 UTC ( [id://11134715]=note: print w/replies, xml ) Need Help??


in reply to Re^8: UTF8 versus \w in pattern matching
in thread UTF8 versus \w in pattern matching

It doesn't look like UTF-8 because it isn't supposed to look like UTF-8.

It has nothing to do with legacy 8-bit.

Data::Dumper shows Unicode codepoints and not encodings.

If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is U, then it is UTF-8, if it is 1, then it is ISO-8859-1.

You can enforce the encoding in Emacs with C-x RET f ISO-8859-1 RET. If you execute the file in this encoding, Perl will croak because you said use utf8; and your source code isn't valid UTF-8.

If you then omit use utf8; with ISO-8859-1 encoding and run the file, you'll get $VAR1 = 't�n'; because now it is your Terminal which expects UTF-8 and gets an 8-bit character.

If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

I don't recommend Data::Dumper for such diagnostics because it might, or might not use \x{} notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.

Replies are listed 'Best First'.
Re^10: UTF8 versus \w in pattern matching
by pryrt (Abbot) on Jul 06, 2021 at 18:49 UTC
    If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

    Assuming the real code is going to use more than one print statement, this suggestion will require calling encode() for every print, which is not DRY programming. Alternative: use the binmode function, as binmode STDOUT, ':encoding(UTF-8)'; , sometime before any print statements, and just use normal print statements (like print Dumper($a);) throughout. This lets the I/O layer handle the translation from Perl's internal representation to UTF-8-encoded output.

      Too many moving parts!!! One should be using the following here:

      local $Data::Dumper::Useqq = 1; print(Dumper($a));

      Fix the problems until you get the correct string (one that contains "\x{e9}" or "\351" for "é"). Then worry about the output to the terminal.

      Seeking work! You can reach me at ikegami@adaelis.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11134715]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-29 10:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found