http://www.perlmonks.org?node_id=1099934


in reply to Re^5: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion.

No idea what that means.

Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV?

Don't know. Don't care. Doesn't matter how they are stored, as those are internal details that aren't relevant.

What does matter is whether they returned decoded text or something else. That has nothing to do with the internal storage format.

To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.

It's completely false — nothing in Perl accepts or produces latin-1 — and it has nothing to do with anything discussed so far.

If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

You were complaining that Perl let you concatenate decoded text and UTF-8 bytes. (Well, you called it something different, but this is the underlying issue.) It has no idea one of the the strings you are concatenating contains text and that the other contains UTF-8 bytes, so it can't let you know that you are doing something wrong.

For example,

my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;

This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.