|Problems? Is your data what you think it is?|
Re: Lost in encodingsby haj (Priest)
|on Feb 07, 2020 at 20:29 UTC||Need Help??|
I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)
One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:
So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.
The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.
Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a 'ü'. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.
Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset => 'utf8', then shout up, I'll write some tests.
As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'
This makes the debugger handle its I/O as UTF-8 encoded.
....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.