Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

To date, the worst problems I've had with character encoding in the post-Unicode era (when everyone takes ASCII for granted -- I've had nastier problems than these prior to this era) has been because some part of the system knew about character encodings, not because parts of the system were blissfully unaware of Unicode encodings.

In my experience, most current forms of information interchange don't have clean enough support for getting the character encoding sent through. Most file systems don't track what encoding the text in the file is recorded in. If you read data from a file, pipe, socket, etc. the built-in library calls are not going to give you any help determining what character encoding you are dealing with.

So, if you want to deal with non-ASCII characters, you usually end up in one of two situations. Either you are immersed in a particular encoding environment and just mostly assume that one encoding (or pick between it and ASCII) and things mostly work well. Or you have to put effort into tracking encoding on one end and effort into conveying encoding to the other end. In both of these situations, it is the pieces in between that are aware of encodings that are likely to cause you problems.

If I have a piece in the middle that blissfully ignores character encoding and just shuffles the bytes between the part on its left and the part on its right, then I can happily, successfully, correctly pass characters through it in whatever encoding I desire.

But if I have a part in the middle that wants/expects to know the encoding of the characters, then it is likely to croak when I send characters that aren't encoded as it expected or to "helpfully" translate from one encoding to another as it passes the data between its neighbors in the system.

For such a middle-layer piece, I then need to communicate to it what encoding is supposed to be used on each side. So instead of picking an encoding and making sure the far end knows which one I picked and being done with the problem, I've got to identify all of the parts in the middle who are encoding-aware and figure out how each of them wants to be informed of encodings (or even if they let me tell them what encoding to use -- they are likely to quite simply insist on UTF-8 and anyone who wants otherwise can go jump in a lake) and then figure out how to get all of these different encodings and notices of encodings to match up so that what comes out the far end is sane.

Someone needs to define the "encoded character stream" to replace the "byte stream" (that Unix managed to make universally supported) so that every I/O layer can choose to either automatically know about encodings or remain blissfully unaware of them. Until then, adding awareness of Unicode to layers will likely cause more problems in many situations.

Nope, I don't have the solution. And I understand the problem of finding a place that doesn't support Unicode when you need it to. I'm just noting the trend and making a prediction that things are going to get much worse as they get better.

                - tye

In reply to Re: Programmers, script languages, and Unicode (ignorance is bliss) by tye
in thread Programmers, script languages, and Unicode by dbwiz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-19 03:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found