Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)

One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:

  • Perl's default text encoding is ISO-8859-1 which is a 1-byte encoding.
  • Contemporary terminals work with UTF-8 encoding. This also includes the terminal you are using in your debugging session.
  • The infamous UTF-8 flag and the is_utf8 function come with a warning:
    CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa.
    In many cases the function tells you what you already know (that the data doesn't look like you expect them), in some cases it is just misleading.

So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.

The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.

Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a ''. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.

Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 '' and indeed wrong here. So if you did provide Charset     => 'utf8', then shout up, I'll write some tests.

As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:

binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'

This makes the debugger handle its I/O as UTF-8 encoded.

....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.


In reply to Re: Lost in encodings by haj
in thread Lost in encodings by Skeeve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2021-11-29 15:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?