in reply to Re^3: How to sanely handle unicode in perl?
in thread How to sanely handle unicode in perl?

I do not assume unicode. I just want to handle data correctly. perl is apparently unable to output data in the way it's environment requires it to.

The frustrating part is that perl looks like it is equipped to work. It is _able_ to do output conversion on the fly. It is just not able to do it correctly without user intervention.

  • Comment on Re^4: How to sanely handle unicode in perl?

Replies are listed 'Best First'.
Re^5: How to sanely handle unicode in perl?
by Your Mother (Archbishop) on Mar 20, 2015 at 19:10 UTC

    \xc3\xb6 is not the right byte(s) for an ö from a Latin-1 terminal, it is the UTF-8 encoding. Meaning it can only be issued by a UTF-8 encoded source (and still mean ö). So what you are asking to do sanely, strikes me as…strange. If it is coming from a Latin-1 encoding source it would be \xf6. To do encoding properly you have to know what you are receiving, decode it with that, and know what your output layer is, encode it to that. It’s not easy but it’s not magical either. Without the right steps at the right layers it’s literally guesswork and impossible to do robustly.

      Please check the source. I explicitly state that the pipe that produces \xc3\xb6 is utf-8. So what you wrote does not apply to my code.

      In fact choroba found out that it works as intended if I prepend ":raw" to the encoding. (Which is unintuitive to me, but kind of makes sense in retrospect)

        Maybe you misunderstand my point. If you run that code in a Latin-1 terminal you are sending UTF-8 and expecting it to act properly. It makes no sense and can’t work without goofy and unrealistic hoops.

Re^5: How to sanely handle unicode in perl?
by soonix (Canon) on Mar 21, 2015 at 22:34 UTC
    I do not assume unicde.
    I think you misparsed that sentence
    “Code that assumes Unicode gives a fig about POSIX locales is broken.”
    This is not
    (Code that assumes Unicode) gives a fig about POSIX locales is broken.
    but
    Code that assumes (Unicode gives a fig about POSIX locales) is broken.
    Update: perhaps I should point out that we seem to share the same native language