Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: text encodings and perl

by sundialsvc4 (Abbot)
on Nov 15, 2010 at 19:21 UTC ( #871552=note: print w/ replies, xml ) Need Help??


in reply to text encodings and perl

Text encodings can be very difficult to work with, partly because there is no absolutely-reliable way to detect what kind of string you are dealing with.   There are at least three different strategies in common use (my terms...)

  • Straight bytes:   “A character is a byte, and a byte is a character.”   But you do not necessarily know what printable character corresponds to a particular byte ... particularly for values beyond 127.
  • Double-byte character sets (DBCS):   Most characters are “straight bytes,” but there are a few “lead-in/lead-out characters” which introduce exceptions to that rule.   When a lead-in is seen, subsequent characters are represented by two bytes until a lead-out is seen.   (The person who devised this scheme should be drawn and quartered... but disk-drives and RAM chips were so much smaller then.)
  • n-byte encodings:   A character corresponds to n bytes, and each character corresponds to the same number of bytes.   Unicode is such a system.

Each of these schemes requires some amount of knowledge that may not be determinable by examining just the data itself.


Comment on Re: text encodings and perl
Re^2: text encodings and perl
by Anonymous Monk on Nov 15, 2010 at 20:30 UTC
    ... Unicode is such a system.
    This is just so wrong. For one, Unicode is not an encoding. Rather, UTF-8, UTF-16 etc. are encodings. And a rather common one of them - UTF-8 - is variable-width, i.e. not same number of bytes per character...

      Thank you for the clarification.   I have revised the post, humbly eating my own words.

      For one, Unicode is not an encoding. Rather, UTF-8, UTF-16 etc. are encodings. And a rather common one of them — UTF-8 — is variable-width, i.e. not same number of bytes per character.

      Both UTF‑8 and also UTF‑16 as well are variable‐width encodings. The essential difference is the size of the code units. There is an infinitude of Java and Windows code (but not necessarily both) out there that screws this up, thinking that UTF‑16 is UCS‑2. It very much is not so.

      Plus UCS‑2 isn’t even a valid Unicode encoding in the first place. UTF‑8, UTF‑16, and UTF‑32 are, and of those, only the last uses fixed‐width code units. UTF‑16 is problematic and annoying in several ways that do not affect either UTF‑8 or UTF‑32, but that doesn’t make it fixed width.

      So the same statement as you’ve made about UTF‑8 applies equally well, mutatis mutandis, to UTF‑16: “UTF‑16 is also a variable‐width encoding, i.e. not the same number of 16‑bit code units per character.” It would be very, very good idea to remain ever conscious of this, given how much harm has been done by negligent programmers who have not done so.

        wait... the tchrist? where you been all these years,man?

Re^2: text encodings and perl
by andal (Friar) on Nov 16, 2010 at 08:27 UTC
    Each of these schemes requires some amount of knowledge that may not be determinable by examining just the data itself.

    Just to make sure. I didn't imply anywhere, that the developer should determine the encoding by examining the data. Personally I believe that guessing the encoding is a sin. It should be done only if there's no other choice. It is much better to force the user to provide the information about the encoding if it is not known already.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://871552]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-11-28 23:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (200 votes), past polls