http://www.perlmonks.org?node_id=11112581

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I'm completely lost in encodings :(

I'm reading data using LWP. When the data contains umlauts, things start to get weird and I get lost.

As the whole script is already very complex, let me try to give examples first before I start thinking about an example script.

For the example I use perl debugger. <&p>

After reading my data one of the strings looks like this:

DB<11> x $str 0 'Knzler' DB<12> x Encode::is_utf8($str) 0 '' DB<13> x length($str) 0 8 DB<14> x substr($str,0,3) 0 'K'

So I think my issue already starts here as the data I read is displayed correctly but perl treats it as bytes.

I do not know what to do with that string so that perl handles it correctly.

DB<20> x Encode::decode('utf8', $str) 0 'K?nzler'

(It was the Questionmark on a square that was displayed). That seemed wrong, but when I tested by reading from an utf-8 file opening it with '<:utf8' I got the same result, so obviously it's correct that way.

So as a test I changed my string here by decoding it as utf8.

In the next step, the string is handed to MIME::Lite::TT::HTML and by that to Text::Table. Finally it's send to me by mail.

When I look at the mail's source, the umlaut is (quoted printable) represented by '=FC' and, unfortunately, displayed as a questionmark in a square :(

I know I should have some sample code, and I will try to write some, but I hoped that in the meantime someone here with more experience already has a hint for me, where to debug further or how to fix it.

As far as I know, Text::Table requires perl strings to properly format tables. But shouldn't my string be a perl string when decoded?

Many thanks in advance.


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: Lost in encodings
by haj (Curate) on Feb 07, 2020 at 20:29 UTC

    I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)

    One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:

    • Perl's default text encoding is ISO-8859-1 which is a 1-byte encoding.
    • Contemporary terminals work with UTF-8 encoding. This also includes the terminal you are using in your debugging session.
    • The infamous UTF-8 flag and the is_utf8 function come with a warning:
      CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa.
      In many cases the function tells you what you already know (that the data doesn't look like you expect them), in some cases it is just misleading.

    So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.

    The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.

    Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a ''. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.

    Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 '' and indeed wrong here. So if you did provide Charset     => 'utf8', then shout up, I'll write some tests.

    As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:

    binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'

    This makes the debugger handle its I/O as UTF-8 encoded.

    ....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.

      > ... I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.

      It's true Devel::Peek gives no guidance why, but which non-AI command does?

      And how can you know his console is using utf-8? Could be Windows and CP850.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Update

      ) Those commands will show me the hex codes in ASCII which is correctly displayed by every terminal (plus the monastery)

        And how can you know is console is using utf-8? Could be Windows and CP850

        I would't claim I know. But since length 'K' is 3 but displays as 'K', I just guessed that a multibyte encoding is in place. CP850 is a 1-byte-encoding and should behave differently.

        As for Devel::Peek: Those commands will show me the hex codes in ASCII

        Devel::Peek will also issue several lines of data which are totally useless unless you're debugging XS code or Perl itself. A decent print unpack 'H*',$data does the same with less fuss.

      Thanks a lot haj. It sounds all good (and complicated). Will need some time to work it out.

      BTW: You're right. My Terminal (iTerm2) is UTF-8. The OS is MacOS.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        I can only speculate as long as you don't show us a Dump of $str.

        > My Terminal (iTerm2) is UTF-8. The OS is MacOS.

        I think if the encoding of the output channel is byte oriented, this would explain your false negative results.

        IOW your decoding is right but the test is wrong.

        update

        Yep, tested on my Ubuntu VM, with utf8 console

        DB<2> use Devel::Peek DB<4> use Encode qw(decode encode) DB<10> $str="k" DB<11> Dump $str SV = PV(0x28054a0) at 0x280a370 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2838a60 "k\303\274"\0 # <-- 303 274 is octa +l for UTF-8 encoding of "" * CUR = 3 LEN = 16 DB<12> p $str k DB<13> p $dec = decode("utf8",$str,Encode::FB_WARN) k DB<14> Dump $dec SV = PVMG(0x28e41f0) at 0x2a19368 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x28fe5a0 "k\303\274"\0 [UTF8 "k\x{fc}"] # <-- correct UTF8 U+ +00FC is codepoint for "" CUR = 3 LEN = 16 DB<15> binmode DB::OUT,':utf8'; # <-- fix encoding la +yer DB<16> p $dec k DB<17>

        Unicode SYMBOL UTF-8 UTF-8 NAME Codepoint hex oct U+00FC c3 bc 303 274 LATIN SMALL LETTER U WITH DIAERE +SIS

        *)

        DB<19> printf "%X ", $_ for  0303, 0274                                                                               
          C3 BC                                                                                                                     
        DB<20> printf "%X ", oct($_) for  qw/303 274/                                                                         
          C3 BC                                                                                                                              
        

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Hi haj

      And thanks once more for your long and helpful reply. I tested now a bit more, adopting your tip to use  decoded_content.

      So when I look now what is read by LWP I really get the correct Umlaut which I also can see when I set binmode on the debugger's IO.

      The problem lies in the output of MIME::Lite::TT::HTML it seems. Looking at the code, it seems one can provide input and output charset. When you don't, MIME::Lite::TT::HTML assumes you already provide the correct charset :( So what I would need to do is provide the Charset of the internal perl strings - which doesn't exist I assume. I think I'll have to patch MIME::Lite::TT::HTML

      As you wrote:

      Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 '' and indeed wrong here. So if you did provide Charset => 'utf8', then shout up, I'll write some tests.

      So here is my shout out. ;)

      I assume the relevant part which needs to be patched is this https://metacpan.org/release/MIME-Lite-TT-HTML/source/lib/MIME/Lite/TT/HTML.pm Line 115-117:

      $charset = [ $charset ] unless ref $charset eq 'ARRAY'; my $charset_input = shift @$charset || 'US-ASCII'; my $charset_output = shift @$charset || $charset_input;

      Here I would provide "something" for the internal perl encoding. Maybe '*internal*'?.

      Starting line 156, the code looks dubious. "remove_utf8_flag" does not seem correct. after what I learned from you and others in my threads.

      And then the

      from_to
      encoding should be changed I guess to:

      if ($charset_input ne $charset_output) { my $perl_string= $charset_input eq '*internal*' ? $string : Encode::decode($charste_input, $string); $string= Encode::encode($charset_output, $perl_string); }

      What do you think?

      Update I've created a patch which allows one to tell MIME::Lite::TT::HTML that text provided ($charset_input) is internal perl representation. With this in place, my script works as expected.

      Unfortunately it seems the module is abandoned as the issues opened for it are 12 years old :(


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        That's a good job in tracking that down to the root cause!

        When I wrote my previous response, I failed to check the version history of MIME::Lite::TT::HTML. Otherwise I would not made the assumption that the module does the right thing. It does not, as you found out. The current release is from 2007 (Perl 5.10-ish), so Unicode support was not only rather new and sometimes bumpy in Perl, but also module authors didn't have much experience with it, nor did all CPAN modules support it.

        After having looked into the module's source code: The module works with all input in byte-encoded form. Today this is considered bad practice since it breaks a lot of Perl's string processing features, including those available from Template Toolkit. The module also assumes that the subject is encoded, in the same encoding as the template files, which is even more questionable. So yes, patching (or subclassing) the module's methods encode_subject and encode_body would be the way to go. Filing an issue for the module would also be fine, but according to the current list of open issues it doesn't look like the auther is still actively maintaining the module.

        There is no keyword for Perl's internal encoding (because, by definition, these strings are decoded). So you could either invent one like *internal* or even us an undefined value as an indicator that your input should not be decoded. Your fix should do the trick if you want to go that path.

        remove_utf8_flag is indeed scary and another example of an attempt to achieve cancellation of errors. I am pretty sure that TT processing could result in this flag being set, even if the TT results are pure ASCII. Instead of re-evaluating his assumptions, the author just killed the flag to make the string fit his expectations. With current Perl you wouldn't get rid of the flag like that, and Encode::decode will happily decode strings which already have the flag set.

        Another alternative with more coding, but better alignment with current practice would be to get rid of $charset_input and expect that the subject and the template parameters are Perl strings. You'd still need TT's ENCODING config because UTF-8 text in files needs decoding, and $charset_output is also still required because MIME::Lite explicitly says that it expects encoded strings.

Re: Lost in encodings
by LanX (Cardinal) on Feb 07, 2020 at 20:16 UTC
    some hints:

    • Please use Data::Dump and/or Devel::Peek to check $str before and after.
    • I don't know in which context you are running your debugger, but it depends on the encoding of the console/editor how an utf8 character is displayed.
    • Try decode('UTF-8', $str,Encode::FB_WARN); to see if the $str is really an UTF-8 (update)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Lost in encodings
by LanX (Cardinal) on Feb 07, 2020 at 19:46 UTC
    update: sorry after rereading the OP I doubt the following is relevant in this case

    I can't test now, but I just tutored my colleagues in unicode+perl and the debugger reacted strange because of it's pre-settings and the win-console.

    (Maybe also dependend on the OS)

    My recommendation is to better test without debugger.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Lost in encodings
by misc (Friar) on Feb 10, 2020 at 06:35 UTC
    I'm not sure, what your problems are.

    However, probably a project of me could be some help.
    I modified the st (suckless terminal) to work with (only) the ASCII extended code.
    It's somehow a joke, but suddenly German Umlauts work like a charm, using vi, shell, andsoon. (Using CP1252)

    https://github.com/michael105/st-asc

    (snip)
    Stripped unicode support in favour of the 256 chars (extended) ASCII table
        utf8 is an optional compiletime switch now
        (Most programs suddenly handle German Umlauts, etc.pp out of the box, using the ASCII table only.
        E.g. bash, vi, .. What is an interesting result. st has a quite good unicode handling,
        but until yet I always needed to dive into the configurations for 
        entering chars like ,, in unicode mode)
    
        Besides, instead of having a history buffer, which needs 15 Bytes per Glyph 
        (a Glyph is a char on the screen with text attributes and colors)
        - now each Glyph is 4 Bytes. What can be nicely optimized.
        I like having a responsive and resource saving terminal, 
        and I'm always keeping at least 10 terminals open at my development system,
        so that sums up.