Ok, after some more research, I think I have a better understanding of the situation, forgive me if I am stating the obvious, but this is for the chaps like myself. UTF-8 is not a character set, it is an encoding method for use with the UCS/Unicode character set which is a multi-byte charset. ISO-8859-1 is a Superset of US-ASCII (i.e. a single byte character set), though it is not an encoding method per se. In that these character sets map to single bytes so no magical encoding has to be done. The way UTF-8 works is thus:
- UCS characters U+0000-U+007F are encoded as simple bytes, this allows for ASCII compatability
- All UCS characters >U+007F are encoded as a sequence of bytes with their most significant byte set.
- The first byte in a multibyte sequence is always in the range of 0xC0-0xFD, and indicates how many bytes follow for this character. All further bytes in the same sequence are in the range of 0x80-0xBF
- All possible 231 UCS codes can be encoded
- The bytes 0xFE & 0xFF are never used in UTF-8 encoding
The following table describes the byte sequences used to represent a character.
Unicode/UCS number | Byte Sequence |
U+00000000-U+0000007F | 0xxxxxxxx |
U+00000080-U+000007FF | 110xxxxx 10xxxxxx |
U+00000800-U+0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
U+00010000-U+001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U+00200000-U+03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U+04000000-U+7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.
For example:
The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.
110XXXXX 10XXXXXX = 0xC0 0X80
11000011 10110110 = 0xC3 0xB6
This explains how %F6 is transcoded to %C3%B6.
CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.
Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource. |