in reply to Re^2: How to sanely handle unicode in perl?
in thread How to sanely handle unicode in perl?

See point 14 in Assume Brokeness of the link I gave — “Code that assumes Unicode gives a fig about POSIX locales is broken.”

  • Comment on Re^3: How to sanely handle unicode in perl?

Replies are listed 'Best First'.
Re^4: How to sanely handle unicode in perl?
by Sec (Monk) on Mar 20, 2015 at 16:56 UTC
    I do not assume unicode. I just want to handle data correctly. perl is apparently unable to output data in the way it's environment requires it to.

    The frustrating part is that perl looks like it is equipped to work. It is _able_ to do output conversion on the fly. It is just not able to do it correctly without user intervention.

      \xc3\xb6 is not the right byte(s) for an ö from a Latin-1 terminal, it is the UTF-8 encoding. Meaning it can only be issued by a UTF-8 encoded source (and still mean ö). So what you are asking to do sanely, strikes me as…strange. If it is coming from a Latin-1 encoding source it would be \xf6. To do encoding properly you have to know what you are receiving, decode it with that, and know what your output layer is, encode it to that. It’s not easy but it’s not magical either. Without the right steps at the right layers it’s literally guesswork and impossible to do robustly.

        Please check the source. I explicitly state that the pipe that produces \xc3\xb6 is utf-8. So what you wrote does not apply to my code.

        In fact choroba found out that it works as intended if I prepend ":raw" to the encoding. (Which is unintuitive to me, but kind of makes sense in retrospect)

      I do not assume unicde.
      I think you misparsed that sentence
      “Code that assumes Unicode gives a fig about POSIX locales is broken.”
      This is not
      (Code that assumes Unicode) gives a fig about POSIX locales is broken.
      but
      Code that assumes (Unicode gives a fig about POSIX locales) is broken.
      Update: perhaps I should point out that we seem to share the same native language