variable-width encodings

in reply to Re^2: text encodings and perl
in thread text encodings and perl

For one, Unicode is not an encoding. Rather, UTF-8, UTF-16 etc. are encodings. And a rather common one of them — UTF-8 — is variable-width, i.e. not same number of bytes per character.

Both UTF‑8 and also UTF‑16 as well are variable‐width encodings. The essential difference is the size of the code units. There is an infinitude of Java and Windows code (but not necessarily both) out there that screws this up, thinking that UTF‑16 is UCS‑2. It very much is not so.

Plus UCS‑2 isn’t even a valid Unicode encoding in the first place. UTF‑8, UTF‑16, and UTF‑32 are, and of those, only the last uses fixed‐width code units. UTF‑16 is problematic and annoying in several ways that do not affect either UTF‑8 or UTF‑32, but that doesn’t make it fixed width.

So the same statement as you’ve made about UTF‑8 applies equally well, mutatis mutandis, to UTF‑16: “UTF‑16 is also a variable‐width encoding, i.e. not the same number of 16‑bit code units per character.” It would be very, very good idea to remain ever conscious of this, given how much harm has been done by negligent programmers who have not done so.

Comment on variable-width encodings

Replies are listed 'Best First'.
Re: variable-width encodings by jdporter (Paladin) on Apr 10, 2011 at 18:42 UTC
wait... the `tchrist`? where you been all these years,man?	[reply]
Re^2: variable-width encodings by tchrist (Pilgrim) on Apr 13, 2011 at 04:14 UTC
To say that I am subfond of writing clumsy ʜᴛᴍʟ merely to chat is gravely understating matters. And I haven’t found the pod option around here yet.	[reply]
Re^3: variable-width encodings by jdporter (Paladin) on Apr 18, 2011 at 15:47 UTC
Can't argue with that. ;-) I will offer, however, that if you're content to post in plain text (as I assume you are, given your ongoing participation in perl5.porters and I don't know what else), you could do that here simply by throwing `<code> ... </code>` tags around your post. If even that seems like too much hassle, you can make an empty set of `<code>` tags your default template via your Signature Settings. And if you don't like the two-phase preview/post cycle, you can enable one-click posting by unchecking "No Forced Preview" in your User Settings. With both of the above done, replying is reduced to the three-step Click Reply; Type Plain-text Message; Click Submit. If being able to submit formatted posts without writing HTML is important to you, there are some ways to hack around the current limitation. For example, if you use Firefox, you could use the It's All Text! add-on and configure it to post-process your text through a filter such as pod2html. (Chrome's security model prevents such a slick solution, but something equivalent is still possible to achieve; see e.g. TextareaConnect or TextAid. I can't speak to other browsers.) I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.	[reply] [d/l] [select]

In Section Meditations