How simple do you want encoding/decoding? Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?
The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.
Update: an example: e-mail - yes, you can send e-mails as UTF-8! But were you aware that MIME headers must be in a 7-bit encoding? In this case blindly opening a socket and telling it to encode all UTF-8 output will severely break your application. It is much better to know specifically when and where encoding is appropriate and permissible..
| [reply] |
Would you like Perl to "automagically" encode/decode JSON? ASN.1?
Why, specifically, do you demand it of UTF-8?
It's a matter of convenience, primarily — and in some cases,
transparency (such as having a single point of configuration where the
encoding can be switched, rather than requiring every piece of code to
take care of it on its own).
The comparison to JSON or ASN.1 seems somewhat far-fetched to me.
Unicode is envisaged - and I think widely accepted - to eventually
become the successor of legacy character encodings such as
Latin-1, with their well known limits. And, among the Unicode
encodings, UTF-8 would presumably be a good choice to be used as
the default (because it was specifically designed with backwards
compatibility in mind).
In contrast, JSON / ASN.1 are rather special purpose (and typically not
used as character encodings), so I don't currently see any need
to have similar built-in support for them in Perl.
The truth is that UTF-8 is a variable-length character encoding
method. It's probably a good thing that you have to explicitly decode
inputs and encode outputs.. it forces you to know what you are doing.
Equally (with a hypothetical pure ASCII mind set in place) you could
say: "The truth is that Latin-1 is a (specific) 8-bit character
encoding method. It's probably a good thing that you have to explicitly
decode inputs and encode outputs.. it forces you to know what you are
doing." — Still, we do have Latin-1 semantics by default in Perl...
Just because UTF-8 is variable length doesn't mean it wouldn't be a
sensible choice in environments that otherwise make use of it, in
particular when the programmer explicitly requests that very
functionality using a pragma.
(...) MIME headers must be in a 7-bit encoding
The current 8-bit default for IO could cause just as much potential
breakage as UTF-8 would in this case. I don't think that particular
limits which apply to certain content (or parts thereof) is a good
argument against generally providing a way to conveniently say "I want
UTF-8 to be used as default for all strings/content" (which is what I
think the OP had in mind).
Special cases can be dealt with in the application code. As things are
now, UTF-8 (or, more generally, anything non-Latin-1) is still too
often the "special case", rather than a (configurable!) global default.
| [reply] |