Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Lost in encodings

by haj (Curate)
on Feb 07, 2020 at 20:29 UTC ( #11112585=note: print w/replies, xml ) Need Help??


in reply to Lost in encodings

I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)

One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:

  • Perl's default text encoding is ISO-8859-1 which is a 1-byte encoding.
  • Contemporary terminals work with UTF-8 encoding. This also includes the terminal you are using in your debugging session.
  • The infamous UTF-8 flag and the is_utf8 function come with a warning:
    CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa.
    In many cases the function tells you what you already know (that the data doesn't look like you expect them), in some cases it is just misleading.

So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.

The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.

Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a 'ü'. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.

Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset     => 'utf8', then shout up, I'll write some tests.

As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:

binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'

This makes the debugger handle its I/O as UTF-8 encoded.

....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.

Replies are listed 'Best First'.
Re^2: Lost in encodings
by LanX (Cardinal) on Feb 07, 2020 at 20:36 UTC
    > ... I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.

    It's true Devel::Peek gives no guidance why, but which non-AI command does? °

    And how can you know his console is using utf-8? Could be Windows and CP850.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    Update

    °) Those commands will show me the hex codes in ASCII which is correctly displayed by every terminal (plus the monastery)

      And how can you know is console is using utf-8? Could be Windows and CP850

      I would't claim I know. But since length 'Kü' is 3 but displays as 'Kü', I just guessed that a multibyte encoding is in place. CP850 is a 1-byte-encoding and should behave differently.

      As for Devel::Peek: Those commands will show me the hex codes in ASCII

      Devel::Peek will also issue several lines of data which are totally useless unless you're debugging XS code or Perl itself. A decent print unpack 'H*',$data does the same with less fuss.

        For completeness–

        perl -Mutf8 -CSD -E 'say length "Kü"' # 2
        It's verbose but will include the utf8 flag plus the dump showing the codepoints in hex. °

        Which is more helpful for us than the OP's copy and paste.

        > A decent print unpack 'H*',$data does the same with less fuss.

        True, but unpack tells me "why" it went wrong? ;)

        Update

        > but displays as 'Kü',

        Provided code areas in the monastery are encoded in utf8. I vividly remember problems here. *

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        *) the monastery is using windows-1252

        °)

        --- Testing: Täst:  "T\xE4st"
        SV = PVMG(0x29c3c98) at 0x29c1fa8
          REFCNT = 1
          FLAGS = (SMG,POK,pIOK,pNOK,pPOK,UTF8)
          IV = 0
          NV = 0
          PV = 0x24c9c68 "T\303\244st"\0 [UTF8 "T\x{e4}st"]
          CUR = 5
          LEN = 10
          MAGIC = 0x2ab1b38
            MG_VIRTUAL = &PL_vtbl_utf8
            MG_TYPE = PERL_MAGIC_utf8(w)
            MG_LEN = 4
        
        Hi again Harald

        > A decent print unpack 'H*',$data does the same with less fuss.

        Actually, why should I bother to spot the non-ASCII between all the hex-codes? °

        Please compare

        DB<50> $data = 'Künzler' DB<51> print unpack 'H*',$data 4b816e7a6c6572 # ORLY? DB<52> use Data::Dump qw/pp dd/ DB<53> dd $data "K\x81nzler" # <--- DB<54> use Devel::Peek DB<55> Dump $data SV = PVNV(0xd9adb8) at 0x351ac30 REFCNT = 1 FLAGS = (POK,IsCOW,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0x355bcd8 "K\201nzler"\0 # <--- CUR = 7 LEN = 10 COW_REFCNT = 2 DB<56>

        Hint: this time not UTF8, did you notice easily?

        Devel::Peek is core and shows multiple relevant infos in one command.

        It has some minor disadvantages, but if the OP had shown us the output we'd knew immediately that his code is correct, except the debugger settings.

        Telling people explicitly not to use it is pretty surprising ...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        °) yes I know that ASCII is below 0x80 and how to spot utf8 multi-bytes. But do others?

        And normally I use a water heater when I need tea and don't start to collect decent wood in the forest. ;-)

Re^2: Lost in encodings
by Skeeve (Parson) on Feb 07, 2020 at 23:41 UTC

    Thanks a lot haj. It sounds all good (and complicated). Will need some time to work it out.

    BTW: You're right. My Terminal (iTerm2) is UTF-8. The OS is MacOS.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      I can only speculate as long as you don't show us a Dump of $str.

      > My Terminal (iTerm2) is UTF-8. The OS is MacOS.

      I think if the encoding of the output channel is byte oriented, this would explain your false negative results.

      IOW your decoding is right but the test is wrong.

      update

      Yep, tested on my Ubuntu VM, with utf8 console

      DB<2> use Devel::Peek DB<4> use Encode qw(decode encode) DB<10> $str="kü" DB<11> Dump $str SV = PV(0x28054a0) at 0x280a370 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2838a60 "k\303\274"\0 # <-- 303 274 is octa +l for UTF-8 encoding of "ü" * CUR = 3 LEN = 16 DB<12> p $str kü DB<13> p $dec = decode("utf8",$str,Encode::FB_WARN) k DB<14> Dump $dec SV = PVMG(0x28e41f0) at 0x2a19368 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x28fe5a0 "k\303\274"\0 [UTF8 "k\x{fc}"] # <-- correct UTF8 U+ +00FC is codepoint for "ü" CUR = 3 LEN = 16 DB<15> binmode DB::OUT,':utf8'; # <-- fix encoding la +yer DB<16> p $dec kü DB<17>

      Unicode SYMBOL UTF-8 UTF-8 NAME Codepoint hex oct U+00FC ü c3 bc 303 274 LATIN SMALL LETTER U WITH DIAERE +SIS

      *)

      DB<19> printf "%X ", $_ for  0303, 0274                                                                               
        C3 BC                                                                                                                     
      DB<20> printf "%X ", oct($_) for  qw/303 274/                                                                         
        C3 BC                                                                                                                              
      

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re^2: Lost in encodings
by Skeeve (Parson) on Feb 10, 2020 at 07:48 UTC

    Hi haj

    And thanks once more for your long and helpful reply. I tested now a bit more, adopting your tip to use  decoded_content.

    So when I look now what is read by LWP I really get the correct Umlaut which I also can see when I set binmode on the debugger's IO.

    The problem lies in the output of MIME::Lite::TT::HTML it seems. Looking at the code, it seems one can provide input and output charset. When you don't, MIME::Lite::TT::HTML assumes you already provide the correct charset :( So what I would need to do is provide the Charset of the internal perl strings - which doesn't exist I assume. I think I'll have to patch MIME::Lite::TT::HTML…

    As you wrote:

    Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset => 'utf8', then shout up, I'll write some tests.

    So here is my shout out. ;)

    I assume the relevant part which needs to be patched is this https://metacpan.org/release/MIME-Lite-TT-HTML/source/lib/MIME/Lite/TT/HTML.pm Line 115-117:

    $charset = [ $charset ] unless ref $charset eq 'ARRAY'; my $charset_input = shift @$charset || 'US-ASCII'; my $charset_output = shift @$charset || $charset_input;

    Here I would provide "something" for the internal perl encoding. Maybe '*internal*'?.

    Starting line 156, the code looks dubious. "remove_utf8_flag" does not seem correct. after what I learned from you and others in my threads.

    And then the

    from_to
    encoding should be changed I guess to:

    if ($charset_input ne $charset_output) { my $perl_string= $charset_input eq '*internal*' ? $string : Encode::decode($charste_input, $string); $string= Encode::encode($charset_output, $perl_string); }

    What do you think?

    Update I've created a patch which allows one to tell MIME::Lite::TT::HTML that text provided ($charset_input) is internal perl representation. With this in place, my script works as expected.

    Unfortunately it seems the module is abandoned as the issues opened for it are 12 years old :(


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      That's a good job in tracking that down to the root cause!

      When I wrote my previous response, I failed to check the version history of MIME::Lite::TT::HTML. Otherwise I would not made the assumption that the module does the right thing. It does not, as you found out. The current release is from 2007 (Perl 5.10-ish), so Unicode support was not only rather new and sometimes bumpy in Perl, but also module authors didn't have much experience with it, nor did all CPAN modules support it.

      After having looked into the module's source code: The module works with all input in byte-encoded form. Today this is considered bad practice since it breaks a lot of Perl's string processing features, including those available from Template Toolkit. The module also assumes that the subject is encoded, in the same encoding as the template files, which is even more questionable. So yes, patching (or subclassing) the module's methods encode_subject and encode_body would be the way to go. Filing an issue for the module would also be fine, but according to the current list of open issues it doesn't look like the auther is still actively maintaining the module.

      There is no keyword for Perl's internal encoding (because, by definition, these strings are decoded). So you could either invent one like *internal* or even us an undefined value as an indicator that your input should not be decoded. Your fix should do the trick if you want to go that path.

      remove_utf8_flag is indeed scary and another example of an attempt to achieve cancellation of errors. I am pretty sure that TT processing could result in this flag being set, even if the TT results are pure ASCII. Instead of re-evaluating his assumptions, the author just killed the flag to make the string fit his expectations. With current Perl you wouldn't get rid of the flag like that, and Encode::decode will happily decode strings which already have the flag set.

      Another alternative with more coding, but better alignment with current practice would be to get rid of $charset_input and expect that the subject and the template parameters are Perl strings. You'd still need TT's ENCODING config because UTF-8 text in files needs decoding, and $charset_output is also still required because MIME::Lite explicitly says that it expects encoded strings.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11112585]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2020-12-02 16:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you use taint mode?





    Results (44 votes). Check out past polls.

    Notices?