toohoo has asked for the wisdom of the Perl Monks concerning the following question:

Dear wise Monks,

I have possibly a missunderstanding of the correct working of package Encode

use Encode; use Data::Dumper; my $temp = encode( "iso-8859-1", 'Köln' ); say Dumper "========== encode =========="; say $temp, "(", length($temp), ")";

I get

Köln(5)

.. which tells me, that this the internal Perl representation of the string in UTF-8 encoding. What I am doing wrong!

Thanks in forehand

Replies are listed 'Best First'.
Re: possible missunderstanding of package Encode
by duelafn (Vicar) on Oct 20, 2015 at 11:09 UTC

    Your understanding of Encode is correct, your input of the original string is the issue. The expression 'Köln' will produce bytes in whatever encoding your script is in, not a decoded string. There are several ways to fix this:

    1. Tell perl that all of your hard-coded strings are INPUT as utf8 (Note: I'm certain that your editor is set up for UTF-8 input from the output you received, but you should convince yourself of that too and then learn how to configure it)

    use utf8; # All hard-coded strings will be assumed to be UTF-8 my $temp = encode( "iso-8859-1", 'Köln' ); ...

    2. Tell perl that this one string was input as utf8 (again, it is UTF-8 because that is what your editor produces)

    my $temp = encode( "iso-8859-1", decode("UTF-8", 'Köln') ); ...

    The second case most closely resembles what happens when you process a file or command-line arguments:

    # Files (change input encoding to match file encoding): open my $F, "<:encoding(UTF-8)", "myfile" or die "Error reading myfile +: $!"; my $line = <$F>; # $line contains a decoded string say encode( "iso-8859-1", $line ); # Command-Line args: my $arg = decode("UTF-8", $ARGV[0]); # Or, command-line args is an appropriate use of Encode::Locale use Encode::Locale; my $arg = decode("locale", $ARGV[0]);

    Your output of "Köln(5)" tells us that your editor and your terminal are in UTF-8 encoding and $temp is double-encoded mojibake (just much less spectacularly obvious than usual mojibake).

    Just keep in mind that once you decide to care about encoding: All input must be first decoded somehow (including strings input directly into program), then it must be encoded before output. If you find odd issues with encoding, ask where it was decoded and where it was encoded (and then ask yourself whether it was decoded or encoded twice).

    Good Day,
        Dean

      Note: I'm certain that your editor is set up for UTF-8 input from the output you received, but you should convince yourself of that too and then learn how to configure it

      I do confirm that:

      In Notepad++ Windows,when editor set to utf8 :

      use v5.10; use Data::Dumper; use Devel::Peek; print unpack "C*",'Köln'; Dump 'Köln'; 75 195 182 108 110 #note that ö has been mapped as two bytes extended ascii values 195 #a +nd 182 SV = PV(0x27c7b54) at 0x8eef84 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK) PV = 0x8f4724 "K\303\266ln"\0 CUR = 5 LEN = 8
      when editor set to iso-8859-1 :
      use v5.10; use Data::Dumper; use Devel::Peek; print unpack "C*",'Köln'; Dump 'Köln'; 75 246 108 110 #note that ö is represented with a single byte extended #ascii decimal + 246 SV = PV(0x2813394) at 0x20cef84 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK) PV = 0x20d4724 "K\366ln"\0 CUR = 4 LEN = 8

      Dear Dean,

      you are my hero of the day!. This was just what I needed to handle the variable/input. I put this in my test-scipt and the output opened my eyes:

      #!/usr/bin/perl use v5.10; use Encode; use Data::Dumper; my $temp = encode( "iso-8859-1", 'Köln' ); say Dumper "========== encode string =========="; say $temp, "(", length($temp), ")"; my $VUOrt0 = 'Köln'; $temp = encode( "iso-8859-1", $VUOrt0 ); say Dumper "========== encode scalar variable =========="; say $temp, "(", length($temp), ")"; $temp = encode( "iso-8859-1", decode("UTF-8", $VUOrt0) ); say Dumper "========== decode encode scalar variable =========="; say $temp, "(", length($temp), ")"; if ( $temp =~ /ö/ ) { say "habe 'ö' gefunden"; } else { say "habe 'ö' +NICHT gefunden"; } if ( $temp =~ /\xF6/ ) { say "habe '\xF6' gefunden"; } else { say "hab +e '\xF6' NICHT gefunden"; } for ( my $i = 0; $i < length($temp); $i++ ) { say substr( $temp, $i, 1), "(", length(substr( $temp, $i, 1)), ")" +; }

      If you might run the script, you see, what i mean. To answer your assumption was right. I am working in a virtualbox with Ubuntu 13.10. My editor is geany and the default justification seams to be UTF-8. I checked this on several used scripts. The shell is simply the terminal.

      Many thanks and have a nice day, Thomas

Re: possible missunderstanding of package Encode
by Anonymous Monk on Oct 20, 2015 at 09:59 UTC

    .. which tells me, that this the internal Perl representation of the string in UTF-8 encoding. What I am doing wrong!

    Its unclear what you think is gong on

    See perlunitut: Unicode in Perl

    The default encoding is something like latin-1, its not utf-8, so you start with some latin-1 string, encode it as latin 1 (nothing changes), then you use length, and you're confused :)

    See this , only once you "decode" do you have actual perl "unicode string" , until then its "binary" (latin1)

    #!/usr/bin/perl -- use strict; use warnings; use Devel::Peek; use Data::Dump; use Encode; our $f = "K\366ln"; sub ff { dd($f); Dump($f); } ff ; $f = encode('iso-8859-1', $f); # bytes encoded as latin1 ff ; $f = encode('UTF-8', $f); # bytes encoded as utf8 ff ; $f = decode('UTF-8', $f); # unicode string ff ; __END__ "K\xF6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xada504 "K\366ln"\0 CUR = 4 LEN = 12 "K\xF6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xb2f424 "K\366ln"\0 CUR = 4 LEN = 12 "K\xC3\xB6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xb2f3dc "K\303\266ln"\0 CUR = 5 LEN = 12 "K\xF6ln" SV = PVMG(0xacf3cc) at 0x99b8f4 REFCNT = 1 FLAGS = (SMG,POK,pIOK,pNOK,pPOK,UTF8) IV = 0 NV = 0 PV = 0xb26374 "K\303\266ln"\0 [UTF8 "K\x{f6}ln"] CUR = 5 LEN = 12 MAGIC = 0xae478c MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 4

      Hello,

      I might possibly have not expressed correctly. The first value that should be used is:

      'Köln'

      .. as written there in single quotes. There might or might not be a further assingment to the scalar variable from database or elsewhere. But the first assignment should work as good as the further. When Perl tells me, that the length is 5, then this is in my eyes not correct iso-8859-1 because in this case it should be only 4 characters. This means independent from what I have in this variable at runtime, the encode should transfer it to the ANSI or ASCII representation. And yes, I know that there is a difference beetween these two. But character 'ö' should be only one byte and not 2. I hope I did express more correctly now.

      thanks

      The last version of my test-script so far:

      #!/usr/bin/perl use v5.10; use Encode; use Data::Dumper; my $temp = encode( "iso-8859-1", 'Köln' ); say Dumper "========== encode string =========="; say $temp, "(", length($temp), ")"; my $VUOrt0 = 'Köln'; $temp = encode( "iso-8859-1", $VUOrt0 ); say Dumper "========== encode scalar variable =========="; say $temp, "(", length($temp), ")";

        I might possibly have not expressed correctly. The first value that should be used is: 'Köln'

        ;) Thats the exact value I used, all of the values produced by encode/decode in my program are exactly 'Köln', the latin1 and binary version and the utf8 version, they're all 'Köln'

        When Perl tells me, that the length is 5, then this is in my eyes not correct iso-8859-1 because in this case it should be only 4 characters.... say Dumper "========== encode string ==========";

        Why are you looking at "length" at all?

        You start with unknown bytes (either utf8 or latin1), perl treats it as bytes or latin1, whether its 4 or 5, it doesn't matter, its not a "unicode string" its a binary string or a latin1 string

        Then you encode this string to latin1 explicitly, now its bytes for sure, this time it makes no sense to look at length -- its the length of the bytes, whatever they are, since you don't know what you started with the new length doesn't matter

        Also , if you're going to Dumper anything it should be data, not banners

        I/O flow (the actual 5 minute tutorial)