Re: Database vs XML output representation of two-byte UTF-8 character

Can anyone explain why this is happening?

1) Internally, Perl has two different kinds of strings. We'll call them 'binary' and 'unicode' strings.

$ perl -MDevel::Peek -e 'Dump "Я"'

FLAGS = (POK,READONLY,IsCOW,pPOK)

This is a binary string.

$ perl -MDevel::Peek -e 'use utf8; Dump "Я"'

FLAGS = (POK,READONLY,IsCOW,pPOK,UTF8)

This is a unicode string. It has the so-called 'UTF8 flag' turned on, while binary strings don't (internally, Perl 'unicode' strings are encoded in UTF-8).

2) Perl pretends that it doesn't have two different types of strings. Whenever a binary string enters a 'Unicode context' (so to say), Perl converts binary string to Unicode, with not entirely satisfactory results. Also, it sometimes try to convert Unicode strings to binary, which also doesn't work very well

$ perl -E 'binmode STDOUT, ":encoding(UTF-8)"; say "Я"'

ŠÆ

Binary string "Я" got mangled in Unicode context.

$ perl -E 'use utf8; my $x = "Я"; no utf8; my $y = "Я"; say $x . $y'

Wide character in say at -e line 1.

ЯŠÆ

Unicode string $x was concatenated with binary string $y, and $y was 'upgraded' to Unicode. At least, we got a warning...

$ perl -wE 'use utf8; my $x = "ē"; no utf8; my $y = "ē"; say $x . $y'

�ē

Where's my warning, Perl?... And what happened with $x???

3) Perl thinks that all binary strings (those without UTF-8 flag) are encoded in Latin-1. Whenever it sees fit, it converts them to Unicode. And vice versa.

c2 bb becomes U+00C2 U+00BB. That is, "»" becomes "Ā»" (from Latin-1 to Unicode). "ē" becomes "�" (from Unicode to Latin-1, which cannot be displayed on my terminal).

4) To make things more interesting, Perl doesn't always turn UTF-8 flag on

$ perl -MDevel::Peek -E 'use utf8; Dump "This is America!"'

FLAGS = (POK,READONLY,IsCOW,pPOK)

Conclusion: to use Perl, you must either be an American, or an expert in Unicode and Perl internals. Well, you seem to be an American, Jim.

Comment on Re: Database vs XML output representation of two-byte UTF-8 character Select or Download Code

Replies are listed 'Best First'.
Re^2: Database vs XML output representation of two-byte UTF-8 character by ikegami (Patriarch) on Sep 07, 2014 at 15:29 UTC
Wow, this is completely wrong. Perl has two different kinds of strings. We'll call them 'binary' and 'unicode' strings. Awful names, and they have nothing to do with binary or Unicode. They are respectively strings of 8-bit chars and string of 72-bit chars. it sometimes try to convert Unicode strings to binary, which also doesn't work very well No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats. Binary string "Я" got mangled in Unicode context. No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats. When you do `$number + $letters`, Perl doesn't mangle anything; you did. When you do `$text + $utf8`, Perl doesn't mangle anything; you did. Conclusion: to use Perl, you must either be an American, or an expert in Unicode and Perl internals. Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion. It doesn't take an American to understand that `4 + apple` is going to be garbage. Decode inputs. Encode inputs. That's it.	[reply] [d/l] [select]
Re^3: Database vs XML output representation of two-byte UTF-8 character by Anonymous Monk on Sep 07, 2014 at 16:33 UTC
Wow, this is completely wrong. No, not completely. More importantly, it's a useful way to think about the problem. Awful names, and they have nothing to do with binary or Unicode. They are respectively strings of 8-bit chars and string of 72-bit chars. Why, 'Unicode' is not an awful name. It's irrelevant that Perl's UTF-8 allows bigger codepoints than the Unicode Consortium defines. 'Binary' is maybe an awful name, but what's more awful is silent conversion from '8-bit chars' to UTF-8, or back. No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats. No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour. It has everything to do with the way Perl works. No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats. No, perl the computer program created garbage, because of the way it works. What does that even mean 'concatenating UTF-8 and text'? Why doesn't that actually work? (you know why). Why can't Perl warn me that I'm doing something stupid? (you know why) When you do $number + $letters, Perl doesn't mangle anything; you did. When you do $text + $utf8, Perl doesn't mangle anything; you did. But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")). Yet here we have no warnings, no nothing. So it's not an error in Perl to do something stupid like $text + $utf8, IT'S SUPPOSED TO WORK LIKE THAT. And you know it. So yes, I can say that Perl mangled the strings, because this is the way it's intended to work. Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion. You know, Ikegami, it's true and not true. I actually know how to use Perl. But Perl provides absolutely no guidance towards that. And... Decode inputs. Encode inputs. Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that? I'd say very few. Do you disagree? I'd even say most Perl programmers actually rarely need to do any encoding/decoding. Do you disagree? It doesn't take an American Perl works just fine when all you have is ASCII (or Latin-1). If you don't have ASCII/Latin-1... are names of files and directories binary or Unicode? (call it what you will). What about command-line parameters? Do I have to decode them? (yes). Ok, why "...or die $!;" produces garbage? Or right, strerror returned something that is not ASCII/Latin-1 (and I heard some of the porters want to make Perl speak only English, arguing that English is better than mojibake). I'd say it's pretty confusing for your Perl average programmer, let's keep things in perspective, Perl was never supposed to be something hardcore like C++.	[reply]
Re^4: Database vs XML output representation of two-byte UTF-8 character by ikegami (Patriarch) on Sep 07, 2014 at 16:49 UTC
No, not completely. More importantly, it's a useful way to think about the problem. Very few need to know about Perl internals that are irrelevant to the problem at hand. Why, 'Unicode' is not an awful name. Because it can be used to store any 72-bit values (well, limited to 32- or 64-bit in practice), not just Unicode code points. You've just demonstrated this. what's more awful is silent conversion from '8-bit chars' to UTF-8, or back. Perl's ability to use a more efficient storage format when possible and a less efficient one when necessary is a great feature, not an awful one. `$x = "a"; $x .= "é";` is no more awful than `$x = 18446744073709551615; ++$x;`. Both cause an internal storage format shift. The lack of ability to tell Perl whether a string is text, UTF-8 or something else is unfortunate because it would allow Perl to catch common errors, but that has nothing to do with the twin storage formats. No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour That would be helped by the aforementioned type system, but not by misunformation. But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")). Unfortunately, Perl does not have the information it would need to have to know you did something wrong. It does warn you when it knows a problem occurred (as you mentioned), but it can't warn when it doesn't know. Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that? Those that work? There's definitely room for improvement, I'm not disputing that.	[reply] [d/l] [select]
Re^5: Database vs XML output representation of two-byte UTF-8 character by Anonymous Monk on Sep 07, 2014 at 17:57 UTC
Re^6: Database vs XML output representation of two-byte UTF-8 character by ikegami (Patriarch) on Sep 09, 2014 at 04:43 UTC
Some notes below your chosen depth have not been shown here
Re^2: Database vs XML output representation of two-byte UTF-8 character by Anonymous Monk on Sep 07, 2014 at 11:00 UTC
(this f site also doesn't like Anti-American letters and stuff)	[reply]


Syntactic Confectionery Delight
	PerlMonks