Re^2: Pragma to handle unicode characters

How simple do you want encoding/decoding? Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?

The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.

Update: an example: e-mail - yes, you can send e-mails as UTF-8! But were you aware that MIME headers must be in a 7-bit encoding? In this case blindly opening a socket and telling it to encode all UTF-8 output will severely break your application. It is much better to know specifically when and where encoding is appropriate and permissible..

Comment on Re^2: Pragma to handle unicode characters

Replies are listed 'Best First'.

Re^3: Pragma to handle unicode characters
by almut (Canon) on Dec 22, 2008 at 06:01 UTC

Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?

It's a matter of convenience, primarily — and in some cases, transparency (such as having a single point of configuration where the encoding can be switched, rather than requiring every piece of code to take care of it on its own).

The comparison to JSON or ASN.1 seems somewhat far-fetched to me. Unicode is envisaged - and I think widely accepted - to eventually become the successor of legacy character encodings such as Latin-1, with their well known limits. And, among the Unicode encodings, UTF-8 would presumably be a good choice to be used as the default (because it was specifically designed with backwards compatibility in mind). In contrast, JSON / ASN.1 are rather special purpose (and typically not used as character encodings), so I don't currently see any need to have similar built-in support for them in Perl.

The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.

Equally (with a hypothetical pure ASCII mind set in place) you could say: "The truth is that Latin-1 is a (specific) 8-bit character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing." — Still, we do have Latin-1 semantics by default in Perl...

Just because UTF-8 is variable length doesn't mean it wouldn't be a sensible choice in environments that otherwise make use of it, in particular when the programmer explicitly requests that very functionality using a pragma.

(...) MIME headers must be in a 7-bit encoding

The current 8-bit default for IO could cause just as much potential breakage as UTF-8 would in this case. I don't think that particular limits which apply to certain content (or parts thereof) is a good argument against generally providing a way to conveniently say "I want UTF-8 to be used as default for all strings/content" (which is what I think the OP had in mind). Special cases can be dealt with in the application code. As things are now, UTF-8 (or, more generally, anything non-Latin-1) is still too often the "special case", rather than a (configurable!) global default.

[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks