Re: Unicode nightmare

by Thelonius (Priest)
on Jul 28, 2006 at 03:04 UTC

in reply to Unicode nightmare

(1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

(2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

Some general character set links:

Node Type: note [id://564267]
