http://www.perlmonks.org?node_id=1099816


in reply to Re^3: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

No, not completely. More importantly, it's a useful way to think about the problem.

Very few need to know about Perl internals that are irrelevant to the problem at hand.

Why, 'Unicode' is not an awful name.

Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've just demonstrated this.

what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.

Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= "é"; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift.

The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.

No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour

That would be helped by the aforementioned type system, but not by misunformation.

But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")).

Unfortunately, Perl does not have the information it would need to have to know you did something wrong.

It does warn you when it knows a problem occurred (as you mentioned), but it can't warn when it doesn't know.

Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that?

Those that work?

There's definitely room for improvement, I'm not disputing that.

Replies are listed 'Best First'.
Re^5: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 07, 2014 at 17:57 UTC
    Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've demonstrated this.
    (shrug) Yeah, I've never used that feature.
    Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= "é"; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift. The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.
    Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion. Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV? I don't see how one can not think about implementation details, storage formats, leaky abstractions and other bad things. To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.
    Unfortunately, Perl does not have the information it would need to have to know you did something wrong.
    For some definitions of 'wrong'. If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

      Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion.

      No idea what that means.

      Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV?

      Don't know. Don't care. Doesn't matter how they are stored, as those are internal details that aren't relevant.

      What does matter is whether they returned decoded text or something else. That has nothing to do with the internal storage format.

      To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.

      It's completely false — nothing in Perl accepts or produces latin-1 — and it has nothing to do with anything discussed so far.

      If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

      You were complaining that Perl let you concatenate decoded text and UTF-8 bytes. (Well, you called it something different, but this is the underlying issue.) It has no idea one of the the strings you are concatenating contains text and that the other contains UTF-8 bytes, so it can't let you know that you are doing something wrong.

      For example,

      my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;

      This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.

        No idea what that means.
        RTFM then.
        "By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1)"
        It's completely false — nothing in Perl accepts or produces latin-1 — and it has nothing to do with anything discussed so far.
        LOL. It looks like Latin-1 and quacks like Latin-1, but it's not Latin-1. Yeah, it's just 'byte-packed subset of Unicode'.
        "Whenever your encoded, binary string is used together with a text string, Perl will assume that your binary string was encoded with ISO-8859-1, also known as latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For example, if it was UTF-8, the individual bytes of multibyte characters are seen as separate characters, and then again converted to UTF-8."
        How about you 'fix' Perl's documentation, and then start arguing... It even talks about 'Unicode' and 'binary' strings (gasp).
        my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;
        This is all the information Perl currently has. Is that an error?
        Is that an error that perl -wE 'my $x = chr(0x00A9); say $x does one thing, and perl -wE 'my $y = chr(0x2660); say $y' does something else? I dunno. You tell me. intuitively, there should be no difference whatsoever, chr should be consistent, say should be consistent, everything should be... (confused) (not really).
        This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.
        So you're not even disagreeing. You just hate the word 'Latin-1'. I'm done with you. Have a nice day.