Some generic hint for encoding problems, especially with Unicode:
- Look at input and output files with a hex dumper / hex editor.
- Far too many editors convert encodings behind the scenes and display garbage when they have a different idea of how the file is encoded from how it is really encoded. od is available on most Unix systems, and there are tons of other hex dumper and hex editors.
- Check the length of strings in perl.
- length always returns the number of characters. If you mess up the encodings, make perl read a "Unicode string" (a string encoded as UTF-8, UTF-16, and so on) as bytes, process it, and write it out as bytes, the code appears to work, but some things (e.g. matching characters) behave strangely. The string "AOUń÷‹", encoded as UTF-8, uses the byte sequence 41 4F 55 C3 84 C3 96 C3 9C. Read with the proper encoding setting, length will return 6. Read as a byte stream, or with a "byte=character" encoding like ISO-8859-1, length will return 9.
- Check the length of strings in the database.
- (Lesson learned from patching DBD::ODBC.) When communicating with a database, encoding problems hide until you read / write with a different tool or check the lengths. This is essentially the same problem as with file I/O, but there is no simple way to get a hex dump.
Feel free to copy from the files t/40UnicodeRoundTrip.t, t/41Unicode.t, and t/UChelp.pm included in DBD::ODBC.
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)