Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^6: Database vs XML output representation of two-byte UTF-8 character

by ikegami (Pope)
on Sep 09, 2014 at 04:43 UTC ( #1099934=note: print w/replies, xml ) Need Help??


in reply to Re^5: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

Again, I agree and don't agree... The assumption that all strings are in one of the storage formats, unless explicitly specified otherwise, is a source of great confusion.

No idea what that means.

Perl's source code (without "use utf8")? Output of readdir? Contents of @ARGV?

Don't know. Don't care. Doesn't matter how they are stored, as those are internal details that aren't relevant.

What does matter is whether they returned decoded text or something else. That has nothing to do with the internal storage format.

To me, 'Perl thinks everything is in Latin-1, unless told otherwise' seems like a more useful, understandable explanation.

It's completely false — nothing in Perl accepts or produces latin-1 — and it has nothing to do with anything discussed so far.

If I actually do have Latin-1 (more realistically, ASCII) than it's not 'wrong', is that what you want to say?

You were complaining that Perl let you concatenate decoded text and UTF-8 bytes. (Well, you called it something different, but this is the underlying issue.) It has no idea one of the the strings you are concatenating contains text and that the other contains UTF-8 bytes, so it can't let you know that you are doing something wrong.

For example,

my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;

This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.

  • Comment on Re^6: Database vs XML output representation of two-byte UTF-8 character
  • Download Code

Replies are listed 'Best First'.
Re^7: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 09, 2014 at 10:41 UTC
    No idea what that means.
    RTFM then.
    "By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1)"
    It's completely false nothing in Perl accepts or produces latin-1 and it has nothing to do with anything discussed so far.
    LOL. It looks like Latin-1 and quacks like Latin-1, but it's not Latin-1. Yeah, it's just 'byte-packed subset of Unicode'.
    "Whenever your encoded, binary string is used together with a text string, Perl will assume that your binary string was encoded with ISO-8859-1, also known as latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For example, if it was UTF-8, the individual bytes of multibyte characters are seen as separate characters, and then again converted to UTF-8."
    How about you 'fix' Perl's documentation, and then start arguing... It even talks about 'Unicode' and 'binary' strings (gasp).
    my $x = chr(0x2660); my $y = chr(0xC3).chr(0xA9); $x . $y;
    This is all the information Perl currently has. Is that an error?
    Is that an error that perl -wE 'my $x = chr(0x00A9); say $x does one thing, and perl -wE 'my $y = chr(0x2660); say $y' does something else? I dunno. You tell me. intuitively, there should be no difference whatsoever, chr should be consistent, say should be consistent, everything should be... (confused) (not really).
    This is all the information Perl currently has. Is that an error? You can't tell. Perl can't tell. Strings coming from a file handle with a decoding layer should be flagged "I'm decoded text!". Those coming from a file handle without a decoding layer should be flagged "I'm bytes!". Concatenating the two should be an error. These flags do not currently exist.
    So you're not even disagreeing. You just hate the word 'Latin-1'. I'm done with you. Have a nice day.

      RTFM then.

      huh? I asked what you meant.

      LOL. It looks like Latin-1 and quacks like Latin-1, but it's not Latin-1. Yeah, it's just 'byte-packed subset of Unicode'.

      huh? What are you talking about?

      How about you 'fix' Perl's documentation, and then start arguing... It even talks about 'Unicode' and 'binary' strings (gasp).

      You can start here. "Perl will assume that your binary string was encoded with ISO-8859-1" is indeed completely wrong. Concatenation does know or care what the string is.

      $ perl -MEncode -E' $x = chr(0x2660); $y = encode($ARGV[0], chr(0xC9)); say sprintf "%vX.%vX %vX", $x, $y, $x.$y; ' iso-latin-1 2660.C9 2660.C9 $ perl -MEncode -E' $x = chr(0x2660); $y = encode($ARGV[0], chr(0xC9)); say sprintf "%vX.%vX %vX", $x, $y, $x.$y; ' UTF-8 2660.C3.89 2660.C3.89

      Is that an error that perl -wE 'my $x = chr(0x00A9); say $x does one thing, and perl -wE 'my $y = chr(0x2660); say $y' does something else?

      No, Perl "doing something else" (telling you you made an error) when you provide a bad input is not an error.

      chr should be consistent,

      huh? chr always returns a string consisting of the specified character.

      So you're not even disagreeing. You just hate the word 'Latin-1'

      huh? What are you talking about?!? No, I hate that you're saying your errors are errors in Perl. I hate that you are spreading misinformation about how Perl works. I hate that you're confusing people with issues that aren't even related to theirs. The OP's problem had nothing to do with internal storage formats.

        What they wrote wasn't hard for me to understand. I think it is due to you being overly submerged in the "the unicode bug" mindset koolaid that prevents you from understanding it. You seem even unable to realize that the author was quoting Perl's own documentation that starkly disagrees with your narrow way of viewing this.

        It is sad that a reasonable heuristic (if somebody concats a UTF-8 string with a non-UTF-8 string, a reasonable approach would be to assume Latin-1 and give a UTF-8 result) chosen for Perl long ago, has been elevated to some bizarre religion dedicated to maintaining with airtight absoluteness the fiction that "it doesn't matter how the string is encoded". And it has come to the point that one can't even try to increase clarity by describing actual facts about how things are encoded without being contradicted by cult members claiming that one is completely wrong.

        Yes, one can choose to view Perl's handling of strings and Unicode in the "the unicode bug" way where how a string is actually encoded/stored shouldn't matter (and quite often doesn't matter in the end). And that can even be a useful approach. But that is not the only valid way to think about this stuff.

        Worse, demanding that people not even consider how a string is actually stored just leaves a huge opportunity for confusion. To be successful in using the "the unicode bug" mindset, many people first have to obtain an understanding of how Perl proposes to make the encoding not matter. So, for many people, you have to first explain the details about the encoding of Perl strings and how it gets changed and why that is often a reasonable approach before they can accept the "the encoding doesn't matter" premise and start making sound decisions based upon it.

        So, for people not already steeped in the "the unicode bug" koolaid, it is best, in my experience, to start with "Perl has byte strings and UTF-8 strings and when they cross paths, the byte string is assumed to be Latin-1 and is upgraded to UTF-8". After that, then you can explain that it isn't really UTF-8 but Perl's own extension to UTF-8 (called "utf8" or so) and that the assumption isn't strictly "Latin-1" (though the distinctions on that second point are too subtle for me to discern with any clarity). But those clarifications mostly just don't matter except to pedants. And then you can explain about how the encoding shouldn't matter and that you are meant to decode all inputs and encode all outputs, etc.

        The worst part about the "the unicode bug" koolaid is that it completely blocks even discussing (much less actually considering) real improvements to Perl's string/Unicode handling. It is completely useless to propose that "assume Latin-1" should actually be "assume Windows-1252" or "assume current locale" because such concepts appear to simply not even make sense to many core maintainers of Perl now.

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1099934]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2021-04-16 17:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?