Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^9: Database vs XML output representation of two-byte UTF-8 character (koolaid)

by tye (Sage)
on Sep 09, 2014 at 16:42 UTC ( [id://1100008]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Database vs XML output representation of two-byte UTF-8 character
in thread Database vs XML output representation of two-byte UTF-8 character

What they wrote wasn't hard for me to understand. I think it is due to you being overly submerged in the "the unicode bug" mindset koolaid that prevents you from understanding it. You seem even unable to realize that the author was quoting Perl's own documentation that starkly disagrees with your narrow way of viewing this.

It is sad that a reasonable heuristic (if somebody concats a UTF-8 string with a non-UTF-8 string, a reasonable approach would be to assume Latin-1 and give a UTF-8 result) chosen for Perl long ago, has been elevated to some bizarre religion dedicated to maintaining with airtight absoluteness the fiction that "it doesn't matter how the string is encoded". And it has come to the point that one can't even try to increase clarity by describing actual facts about how things are encoded without being contradicted by cult members claiming that one is completely wrong.

Yes, one can choose to view Perl's handling of strings and Unicode in the "the unicode bug" way where how a string is actually encoded/stored shouldn't matter (and quite often doesn't matter in the end). And that can even be a useful approach. But that is not the only valid way to think about this stuff.

Worse, demanding that people not even consider how a string is actually stored just leaves a huge opportunity for confusion. To be successful in using the "the unicode bug" mindset, many people first have to obtain an understanding of how Perl proposes to make the encoding not matter. So, for many people, you have to first explain the details about the encoding of Perl strings and how it gets changed and why that is often a reasonable approach before they can accept the "the encoding doesn't matter" premise and start making sound decisions based upon it.

So, for people not already steeped in the "the unicode bug" koolaid, it is best, in my experience, to start with "Perl has byte strings and UTF-8 strings and when they cross paths, the byte string is assumed to be Latin-1 and is upgraded to UTF-8". After that, then you can explain that it isn't really UTF-8 but Perl's own extension to UTF-8 (called "utf8" or so) and that the assumption isn't strictly "Latin-1" (though the distinctions on that second point are too subtle for me to discern with any clarity). But those clarifications mostly just don't matter except to pedants. And then you can explain about how the encoding shouldn't matter and that you are meant to decode all inputs and encode all outputs, etc.

The worst part about the "the unicode bug" koolaid is that it completely blocks even discussing (much less actually considering) real improvements to Perl's string/Unicode handling. It is completely useless to propose that "assume Latin-1" should actually be "assume Windows-1252" or "assume current locale" because such concepts appear to simply not even make sense to many core maintainers of Perl now.

- tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1100008]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-03-29 05:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found