Text encodings can be very difficult to work with, partly because there is no absolutely-reliable way to detect what kind of string you are dealing with. There are at least three different strategies in common use (my terms...)
- Straight bytes: “A character is a byte, and a byte is a character.” But you do not necessarily know what printable character corresponds to a particular byte ... particularly for values beyond 127.
Double-byte character sets (DBCS): Most characters are “straight bytes,” but there are a few “lead-in/lead-out characters” which introduce exceptions to that rule. When a lead-in is seen, subsequent characters are represented by two bytes until a lead-out is seen. (The person who devised this scheme should be drawn and quartered... but disk-drives and RAM chips were so much smaller then.)
n-byte encodings: A character corresponds to n bytes, and each character corresponds to the same number of bytes.
Unicode is such a system.
Each of these schemes requires some amount of knowledge that may not be determinable by examining just the data itself.