Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Database vs XML output representation of two-byte UTF-8 character

by ikegami (Patriarch)
on Sep 07, 2014 at 15:29 UTC ( [id://1099810]=note: print w/replies, xml ) Need Help??


in reply to Re: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

Wow, this is completely wrong.

Perl has two different kinds of strings. We'll call them 'binary' and 'unicode' strings.

Awful names, and they have nothing to do with binary or Unicode.

They are respectively strings of 8-bit chars and string of 72-bit chars.

it sometimes try to convert Unicode strings to binary, which also doesn't work very well

No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats.

Binary string "Я" got mangled in Unicode context.

No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats.

When you do $number + $letters, Perl doesn't mangle anything; you did.
When you do $text + $utf8, Perl doesn't mangle anything; you did.

Conclusion: to use Perl, you must either be an American, or an expert in Unicode and Perl internals.

Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion.

It doesn't take an American to understand that 4 + apple is going to be garbage. Decode inputs. Encode inputs. That's it.

Replies are listed 'Best First'.
Re^3: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 07, 2014 at 16:33 UTC
    Wow, this is completely wrong.
    No, not completely. More importantly, it's a useful way to think about the problem.
    Awful names, and they have nothing to do with binary or Unicode. They are respectively strings of 8-bit chars and string of 72-bit chars.
    Why, 'Unicode' is not an awful name. It's irrelevant that Perl's UTF-8 allows bigger codepoints than the Unicode Consortium defines. 'Binary' is maybe an awful name, but what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.
    No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats.
    No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour. It has everything to do with the way Perl works.
    No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats.
    No, perl the computer program created garbage, because of the way it works. What does that even mean 'concatenating UTF-8 and text'? Why doesn't that actually work? (you know why). Why can't Perl warn me that I'm doing something stupid? (you know why)
    When you do $number + $letters, Perl doesn't mangle anything; you did. When you do $text + $utf8, Perl doesn't mangle anything; you did.
    But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")). Yet here we have no warnings, no nothing. So it's not an error in Perl to do something stupid like $text + $utf8, IT'S SUPPOSED TO WORK LIKE THAT. And you know it. So yes, I can say that Perl mangled the strings, because this is the way it's intended to work.
    Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion.
    You know, Ikegami, it's true and not true. I actually know how to use Perl. But Perl provides absolutely no guidance towards that. And...
    Decode inputs. Encode inputs.
    Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that? I'd say very few. Do you disagree? I'd even say most Perl programmers actually rarely need to do any encoding/decoding. Do you disagree?
    It doesn't take an American
    Perl works just fine when all you have is ASCII (or Latin-1). If you don't have ASCII/Latin-1... are names of files and directories binary or Unicode? (call it what you will). What about command-line parameters? Do I have to decode them? (yes). Ok, why "...or die $!;" produces garbage? Or right, strerror returned something that is not ASCII/Latin-1 (and I heard some of the porters want to make Perl speak only English, arguing that English is better than mojibake). I'd say it's pretty confusing for your Perl average programmer, let's keep things in perspective, Perl was never supposed to be something hardcore like C++.

      No, not completely. More importantly, it's a useful way to think about the problem.

      Very few need to know about Perl internals that are irrelevant to the problem at hand.

      Why, 'Unicode' is not an awful name.

      Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've just demonstrated this.

      what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.

      Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= "é"; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift.

      The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.

      No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour

      That would be helped by the aforementioned type system, but not by misunformation.

      But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")).

      Unfortunately, Perl does not have the information it would need to have to know you did something wrong.

      It does warn you when it knows a problem occurred (as you mentioned), but it can't warn when it doesn't know.

      Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that?

      Those that work?

      There's definitely room for improvement, I'm not disputing that.

        Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've demonstrated this.
        (shrug) Yeah, I've never used that feature.
        Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. $x = "a"; $x .= "é"; is no more awful than $x = 18446744073709551615; ++$x;. Both cause an internal storage format shift. The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats.
        Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion. Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV? I don't see how one can not think about implementation details, storage formats, leaky abstractions and other bad things. To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.
        Unfortunately, Perl does not have the information it would need to have to know you did something wrong.
        For some definitions of 'wrong'. If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1099810]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-26 04:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found