On the surface, yes, it looks bad. But from my experience, you can cover nearly all cases (like 99.5% or so) by following some simple rules, no matter the encoding:
- Convert all incoming data to perls internal representation (utf8_decode or similar)
- Convert all outgoing data to the correct encoding (utf8 or similar)
- Unless you really have to verify very specific things in text, just treat it like a random binary blob.
- 0 + $var works for converting text to numeric values.
- If you do any type of string comparison in your code, always normalize both sides using Unicode::Normalize and always stick to the same normalization form.
- Don't assume that any other text encoding standard is saner. Or even a global standard.
The basic ugliness of Unicode (or other text encodings) stems not from their engineers but from the basic fact that human language is a complicated mess. And written language is still a somewhat new concept in human evolution and we are still trying to figure out the finer details. At least with Unicode, you don't have to constantly switch schemes depending on who is using your software.
| [reply] [d/l] |