|Problems? Is your data what you think it is?
FMTYEWTK about Characters vs Bytesby halley (Prior)
|on Feb 20, 2004 at 15:49 UTC
Characters are NOT the same as bytes.
The term character is a logical term (meaning it defines something in terms of the way people think of things). The term byte is a device term (meaning it defines something in terms of the way the hardware was designed). The difference is in the encoding.
The character 'A' must be encoded into a certain bit pattern that the machine can use. This is a mostly arbitrary decision on the part of hardware implementors. You could say that A=1, B=2, and so on. For ASCII (the American Standard for Computer Information Interchange), designers chose a 7-bit encoding (A=1000001), giving room for encoding 128 different characters in those seven bits.
Since a byte is 8-bits on most hardware today, the hardware just pads the extra bit with a zero (A=01000001). So even in the case of simple ASCII, characters are not the same thing as bytes.
ASCII reserves 32 encodings for special characters like the TAB, NEWLINE, FORMFEED, and then the remaining encodings are for printing characters like SPACE, !, @, #, $, A-Z, a-z, 0-9, and so on.
What can the eighth bit be used for? It gives 128 more character encoding numbers. Each manufacturer has used the other 128 encodings above ASCII (values 128-255) for special printing characters, such as GRAY-BLOCK, N-WITH-TILDE, and YEN. They've been pretty horrible about consistency, though: IBM's GRAY-BLOCK character is not the same number as Commodore PET's GRAY-BLOCK character. Some computers have a YEN symbol while others don't.
ASCII is not the only way to encode characters into bit patterns. EBCDIC was a popular eight-bit encoding scheme used on some mainframes. If you transferred a file from one machine to another, you would have to convert each encoded character according to a look-up table, so A=193 (EBCDIC) would end up in the new file as A=65 (ASCII), or vice versa. Character types that were not in common between the two encodings would have to be dropped or assigned a replacement value, essentially destroying information.
There are special-purpose encodings for certain applications. On the DEC PDP-11, which usually interacted with the user in ASCII, another encoding called ROT50 was a common way to pack 3 letters and digits into two bytes. It could not fit any punctuation or special characters, so it was suited only for specific tasks. FORTRAN compilers often used it to pack six-letter variable names in less space. This also allowed "six-dot-three" filenames to fit in six byte data records on the storage devices. "Eight-dot-three" filenames came later.
Windows and MS-DOS offered code pages to help international users fit their most important data into one-byte character encodings. If you were using a Russian code-page, then a byte value of 136 meant one thing, but it meant an entirely different thing in the Netherlands' code-page. Web pages use a similar scheme to provide support for different character sets like ISO8859-1.
The problem of defining content on a certain character set gets worse in a global community. Unicode was designed to replace all that swapping around of character sets and make one canonical encoding. Every known character in every language would get its own permanent number.
Unicode has several tens of thousands of characters, from Hebrew to Arabic to Kanji to Cyrillic. It has more control characters to deal with the minimum text positioning requirements of various languages, such as the doubled layering of Japanese Kana over Kanji which helps readers with uncommon words.
The numbers range into the 60,000s. You clearly can't encode those into single bytes anymore. Even if you don't have the which-language-is-supported character set problem, you still haven't gotten rid of the encoding problem. You never will: characters are human concepts and bytes are device concepts.
The brute force encoding would just take, say, four bytes for every character. You still would have to agree on whether the least-significant byte is written first or last, but there's room for every conceivable current and future Unicode character number. Oh, but it wastes a LOT of space, since the vast majority of commerce can still just fit in the ASCII range. Just saying hello takes twenty bytes.
To optimize the storage, a few specialized encodings have come up. The Unicode character numbers don't change, but the way the numbers are packed into bytes and bits are changed. Instead of four bytes per character, which was really excessive, let's do two bytes, and if it just happens to be one of those rare numbers that won't fit in two bytes, use some reserved bit flags to indicate that the rest of the number follows in the third byte. It's getting pretty complicated to encode or decode strings, but it saves a lot of space.
Currently, the most popular Unicode encoding is even tighter. It's called UTF-8. It allows any all-ASCII content to remain in the old one-byte-per-character encoding, completely unchanged. That's popular and efficient for all that ASCII content. The last bit doesn't offer you 128 more characters, it tells the Unicode program's decoder that it's a non-ASCII character, and that some more bits are in the next byte. Or the third byte. Or the fourth byte. Or as far away as requiring six bytes to just grab a single Unicode character with a high number.
The benefit of UTF-8 takes advantage of the fact that Unicode associates the more common world characters with lower numbers, so UTF-8 requires less space to encode those character numbers. Rare or specialized characters from languages like Klingon, Feanorian (Tolkien Elvish) Tengwar, Heiroglyphics or Cuneiform, will take more space to store.
There will still be application-specific encodings, even using Unicode's numbering scheme. Just like ROT50 was designed to store FORTRAN variable names and filenames in less space, an encoding called "puny code" (or "P-Unicode") can encode just about any Unicode string by using only the few byte values allowed in a domain name registration record. The domain registrar may see "egbpdaj6bu4bxfgehfvwxn.com" but a Punycode-compliant decoder in a modern web browser would decode those characters and render a domain name spelled with flowing Egyptian Arabic characters.