Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Database vs XML output representation of two-byte UTF-8 character

by Anonymous Monk
on Sep 07, 2014 at 16:33 UTC ( #1099814=note: print w/replies, xml ) Need Help??


in reply to Re^2: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

Wow, this is completely wrong.
No, not completely. More importantly, it's a useful way to think about the problem.
Awful names, and they have nothing to do with binary or Unicode. They are respectively strings of 8-bit chars and string of 72-bit chars.
Why, 'Unicode' is not an awful name. It's irrelevant that Perl's UTF-8 allows bigger codepoints than the Unicode Consortium defines. 'Binary' is maybe an awful name, but what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.
No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats.
No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour. It has everything to do with the way Perl works.
No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats.
No, perl the computer program created garbage, because of the way it works. What does that even mean 'concatenating UTF-8 and text'? Why doesn't that actually work? (you know why). Why can't Perl warn me that I'm doing something stupid? (you know why)
When you do $number + $letters, Perl doesn't mangle anything; you did. When you do $text + $utf8, Perl doesn't mangle anything; you did.
But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")). Yet here we have no warnings, no nothing. So it's not an error in Perl to do something stupid like $text + $utf8, IT'S SUPPOSED TO WORK LIKE THAT. And you know it. So yes, I can say that Perl mangled the strings, because this is the way it's intended to work.
Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion.
You know, Ikegami, it's true and not true. I actually know how to use Perl. But Perl provides absolutely no guidance towards that. And...
Decode inputs. Encode inputs.
Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that? I'd say very few. Do you disagree? I'd even say most Perl programmers actually rarely need to do any encoding/decoding. Do you disagree?
It doesn't take an American
Perl works just fine when all you have is ASCII (or Latin-1). If you don't have ASCII/Latin-1... are names of files and directories binary or Unicode? (call it what you will). What about command-line parameters? Do I have to decode them? (yes). Ok, why "...or die $!;" produces garbage? Or right, strerror returned something that is not ASCII/Latin-1 (and I heard some of the porters want to make Perl speak only English, arguing that English is better than mojibake). I'd say it's pretty confusing for your Perl average programmer, let's keep things in perspective, Perl was never supposed to be something hardcore like C++.
  • Comment on Re^3: Database vs XML output representation of two-byte UTF-8 character

Replies are listed 'Best First'.
Re^4: Database vs XML output representation of two-byte UTF-8 character
by ikegami (Pope) on Sep 07, 2014 at 16:49 UTC

    No, not completely. More importantly, it's a useful way to think about the problem.

    Very few need to know about Perl internals that are irrelevant to the problem at hand.

    Why, 'Unicode' is not an awful name.

    Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've just demonstrated this.

    what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.

    Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= ""; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift.

    The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.

    No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour

    That would be helped by the aforementioned type system, but not by misunformation.

    But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")).

    Unfortunately, Perl does not have the information it would need to have to know you did something wrong.

    It does warn you when it knows a problem occurred (as you mentioned), but it can't warn when it doesn't know.

    Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that?

    Those that work?

    There's definitely room for improvement, I'm not disputing that.

      Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've demonstrated this.
      (shrug) Yeah, I've never used that feature.
      Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= ""; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift. The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.
      Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion. Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV? I don't see how one can not think about implementation details, storage formats, leaky abstractions and other bad things. To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.
      Unfortunately, Perl does not have the information it would need to have to know you did something wrong.
      For some definitions of 'wrong'. If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

        Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion.

        No idea what that means.

        Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV?

        Don't know. Don't care. Doesn't matter how they are stored, as those are internal details that aren't relevant.

        What does matter is whether they returned decoded text or something else. That has nothing to do with the internal storage format.

        To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.

        It's completely false — nothing in Perl accepts or produces latin-1 — and it has nothing to do with anything discussed so far.

        If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

        You were complaining that Perl let you concatenate decoded text and UTF-8 bytes. (Well, you called it something different, but this is the underlying issue.) It has no idea one of the the strings you are concatenating contains text and that the other contains UTF-8 bytes, so it can't let you know that you are doing something wrong.

        For example,

        my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;

        This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1099814]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2021-02-27 03:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?