Rather than printing it, which really doesn't tell you much (looking correct might even mean that it's wrong!), try dumping the character numbers. Also check what perl thinks is the "length".
my $x= "\x{B0}F";
say map sprintf("%02X ",ord), split //, $x;
One problem is that the degree sign 0xB0 is within the lower 0xFF of unicode, so perl can represent it in both ascii form and in utf-8, and this can make it extra confusing to track down the problem.
Some pointers that might help the debugging:
- According to docs, SQLite *always* uses unicode, so it shouldn't be possible to have the raw \xB0 byte stored in a column. You can rule that out.
- If perl's database interface were configured incorrectly, reading the unicode character \xB0 (which sqlite should encode as \C2\B0) would arrive in perl as two characters. You can find out if this is the case using "length", or by hex-dumping the characters as shown above.
- It's perfectly possible for someone to take utf-8 bytes and tell SQLite its a string of unicode characters, and end up with \xC2 and \xB0 stored as two characters (encoded as 4 utf-8 bytes). I would refer to this situation as being "double-encoded".
- You can repair double-encoded data using perl's utf8::decode($x). Note, that decodes the string in-place, rather than returning the decoded value. It is *almost* always safe to call this on a string whenever you're in doubt. It is unlikely that any real text would contain two characters that could be mistaken for a utf-8 sequence. This is my go-to whenever I have partly corrupted data after an encoding mistake was deployed to production and polluted the database with some double-encoded data.
- You can only trust "print" to show you encoding problems if perl's STDOUT has the :utf8 layer applied and if your terminal is strictly UTF-8. If perl does not have the encoding layer, there's a chance it will emit valid UTF-8 anyway, and the terminal won't see anything wrong. I emphasize chance here, because \xB0 is within the single-byte range, and perl may or may not have used an internal UTF-8 encoding for the string. There's also the chance that a terminal has "helpful" support for programs that emit bytes, and silently upgrades it to unicode; I don't know anything specifically about Eclipse's terminal, but I would be cautious about trusting it to reveal encoding errors.
| [reply] [d/l] [select] |