|Perl: the Markov chain saw|
Re: OT? Character set issues with MySQL/CGI::Application (funny "A" + garbage)by tye (Cardinal)
|on Jul 25, 2008 at 06:59 UTC||Need Help??|
That means that you are sending UTF-8 to a browser that is expecting Latin-1. That is probably the most common Unicode problem and the "funny 'A' plus a garbage character" in place of some international letter is dead typical.
If it were just some layer in your DB connection or Perl "helpfully" converting to UTF-8 for you, then you are supposed to get a warning when you try to output this UTF-8 to your http-daemon because you haven't declared that this output (I/O) handle understands UTF-8. So that may mean that your problem is that you've "declared that the CGI output (I/O) handle is expecting UTF-8" (since you mentioned no warning).
More likely, your DB is giving UTF-8 strings to your Perl and nobody bothered to inform your Perl of this detail. So Perl doesn't know that its string of bytes is actually encoded as UTF-8 characters so Perl can't warn you but is still writing out the bytes of UTF-8-encoded characters (as opposed to knowing that it is writing out UTF-8 characters by writing out the bytes that they are made of).
Unicode was designed by people who had gotten used to the utopia of "everything is a byte stream" while not realizing that their creation was going to destroy that utopia so their plans were woefully inadequate. (I got to ride a small bit of the tail of the world before everybody just took for granted that everything was a byte stream.)
In this painful transition world (before we eventually arrive at the designed "everything is a Unicode stream with appropriate BOM or meta data regarding encoding" "utopia" (the term "my(t)opia" springs to mind, especially with regard to the prior paragraph), one often must be quite careful at every layer to ensure that both sides of that layer agree on the expected encoding. And the layers can be quite numerous.
You have an advantage in this case in that Perl adds the "is this Unicode?" metadata to its strings and (mostly) to its streams, so the odds are that the layer that is currently causing you problems is likely nearly outside of Perl, probably on the database side.
My first step would be to upgrade the DBD driver module and see if the problem just goes away. The most likely layers to cause problems are the ones where the authors on each side are the least well connected. Although the authors of a DBD module usually try pretty hard to stay well connected to both their database of choice and to Perl, you don't have to go very far back to find a version (of most DBDs) that isn't dealing with Unicode quite the way their database of choice currently does and/or isn't dealing with Unicode quite the way Perl currently does (Unicode support is still a relatively new concept that is still subject to significant "evolution").