http://www.perlmonks.org?node_id=1194998

karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

after reading Encoding horridness i played around a bit:

#!/usr/bin/env perl # $Id: weird.pl,v 1.3 2017/07/13 09:06:54 karl Exp karl $ use strict; use warnings; use feature qw(say); my $file = q(weird.txt); open( my $fh, '>', $file ); binmode $fh, ':encoding(UTF-8)'; say $fh qq(nase\ngöre); close $fh; say qx (file -I $file); say qx(echo \$LANG); say qx(cat $file); open( $fh, '<', $file ); binmode $fh, ':encoding(UTF-8)'; say <$fh>; close $fh; __END__

This is leading to:

karls-mac-mini:monks karl$ ./weird.pl weird.txt: text/plain; charset=utf-8 de_DE.UTF-8 nase göre nase göre

And if i say use utf8; i get:

karls-mac-mini:monks karl$ ./weird.pl weird.txt: text/plain; charset=utf-8 de_DE.UTF-8 nase göre nase g?re

What do i miss?

Thanks for any hint and best regards, Karl

Update: Two working solutions:

Update2: Sorry, wrong merits.

1nickt:

#!/usr/bin/env perl # $Id: weird_1nickt.pl,v 1.2 2017/07/13 17:10:29 karl Exp karl $ + use strict; use warnings; use feature qw(say); use utf8; my $file = q(weird.txt); open( my $fh, '>', $file ); binmode $fh, ':encoding(UTF-8)'; say $fh qq(nase\ngöre); close $fh; say qx (file -I $file); say qx(echo \$LANG); say qx(cat $file); open( $fh, '<', $file ); binmode $fh, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; say <$fh>; close $fh; __END__ karls-mac-mini:monks karl$ ./weird_1nickt.pl weird.txt: text/plain; charset=utf-8 de_DE.UTF-8 nase göre nase göre

choroba:

#!/usr/bin/env perl # $Id: weird_choroba.pl,v 1.2 2017/07/13 20:47:38 karl Exp karl +$ use strict; use warnings; use feature qw(say); use utf8; use open IO => ':encoding(utf-8)', ':std'; my $file = q(weird.txt); open( my $fh, '>', $file ); say $fh qq(nase\ngöre); close $fh; say qx (file -I $file); say qx(echo \$LANG); say qx(cat $file); open( $fh, '<', $file ); say <$fh>; close $fh; __END__ karls-mac-mini:monks karl$ ./weird_choroba.pl weird.txt: text/plain; charset=utf-8 de_DE.UTF-8 nase göre nase göre

«The Crux of the Biscuit is the Apostrophe»

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Replies are listed 'Best First'.
Re: Encoding horridness revisited: What's going on here?
by 1nickt (Canon) on Jul 13, 2017 at 13:14 UTC

    Hi Karl,

    What do i miss?

    As I understand it, you missed telling Perl to encode your STDOUT output as UTF-8, after you read it in the second time (from the file). You don't need to (should not) do so when printing the output of cat, since your terminal already handles the encoding correctly.

    use strict; use warnings; use feature qw(say); use utf8; # <-- needed, since you have high characters in your source my $file = q(weird.txt); open( my $fh, '>', $file ); binmode $fh, ':encoding(UTF-8)'; say $fh qq(nase\ngöre); close $fh; say qx (file -I $file); say qx(echo \$LANG); say qx(cat $file); open( $fh, '<', $file ); binmode $fh, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; # <-- here say <$fh>; close $fh; __END__


    The way forward always starts with a minimal test.
      Or just add
      use utf8; use open IO => ':encoding(UTF-8)', ':std';

      and remove all binmode or :encoding in open calls.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        This works! I'll try the solution from 1nickt and update the OP. Very nice!

        Thank you very much, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      This works as well! Please see my update above.

      Thank you very much, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Encoding horridness revisited: What's going on here?
by choroba (Cardinal) on Jul 13, 2017 at 09:47 UTC
    In what encoding did you save the source file? What encoding does your terminal use?
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      emacs:

      M-x describe-current-coding-system Coding system for saving this buffer: u -- mule-utf-8-unix Default coding system (for new files): u -- mule-utf-8 (alias: utf-8) Coding system for keyboard input: u -- utf-8 (alias of mule-utf-8) Coding system for terminal output: u -- utf-8 (alias of mule-utf-8) Defaults for subprocess I/O: decoding: u -- mule-utf-8 (alias: utf-8) encoding: u -- mule-utf-8 (alias: utf-8)

      bash:

      karls-mac-mini:monks karl$ locale LANG="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_CTYPE="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_ALL=

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        Can you trust your terminal emulator to properly handle the output?

        To me, encoding issues are always a wild goose chase, so I like to eliminate as many things from the encoding dance as quickly as possible. Usually that means that instead of including umlauts (or whatever) in my source code, I use the character names instead:

        # instead of use utf8; my $s = "göre";
        # I prefer to use use charnames; my $s = "g\N{LATIN SMALL LETTER O WITH DIAERESIS}re";

        This eliminates the issue that my text editor is lying to me.

        When inspecting the output, I either pipe the output through hexdump or through Data::Dumper with $Data::Dumper::Useqq =1; so the console only sees 7bit ASCII.

        This eliminates my terminal emulator lying to me.

        Of course, that does not help with reading data from files that I don't control, but every little step helps.

Re: Encoding horridness revisited: What's going on here?
by gandolf989 (Scribe) on Jul 13, 2017 at 15:12 UTC
    I had a similar issue. I was trying to send a plain text email from Linux. I tried various Perl modules, mail, sendmail, etc. I then realized that Outlook was formatting my email. If you use Outlook 2016 open one of your test emails, go under file, properties and look at the header information. You should see something like this: Content-Type: text/html; charset="ISO-8859-1". If you find that the encoding is correct, then you need to check to see what Outlook is doing with your email. On your Outlook screen under file, options, mail you will see a button for stationary and fonts. The select the font button for composind and reading plain text messages and make sure that it is set to courier new and whatever font size you like. Then try opening one of you plain text emails and see if your white space formatting is OK.