Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

OT? Character set issues with MySQL/CGI::Application

by cLive ;-) (Prior)
on Jul 24, 2008 at 23:40 UTC ( #700001=perlquestion: print w/replies, xml ) Need Help??
cLive ;-) has asked for the wisdom of the Perl Monks concerning the following question:

Or at least, that's my guess. I'm not sure what I should be looking for. The charset used by my table is latin1. I checked the http header of the generated html:

Content-Type: text/html; charset=ISO-8859-1

ie, latin1. Now, my biggest issue here is that I don't know what exactly the issue is that I'm meant to be examining.

What changes a correct display of Español in the DB and when testing in the shell to Español in the web page output.

Anyone got any pointers?

update: When I call the test page using wget, the output displays correctly in vi, so I'm assuming that the issue is in what the browser thinks the output is. Hmmmmm.

update 2: changing the charset through header_props() to 'utf-8' presents the data correctly. So, I'm assuming that MySQL is throwing out the data as utf-8. Well, getting *some* progress here ;-)

Replies are listed 'Best First'.
Re: OT? Character set issues with MySQL/CGI::Application
by pc88mxer (Vicar) on Jul 24, 2008 at 23:58 UTC
    For a quick-n-dirty fix, try this:
    # $text is the variable containing "Español" from the database use Encode; $text = encode("iso-8859-1", decode("utf-8", $text)); # now print $text
      That does provide correct output (thanks), but does not explain why it's happening or how to resolve it correctly.

      I don't want to use a hack like that in production code. Hmmmm.

        If you can convert it from UTF-8 to Latin-1 and it then works, it means that the data you get from the data is UTF-8, and not Latin-1 as you think.

        In case of doubt print it out on the console and use hexdump to check.

Re: OT? Character set issues with MySQL/CGI::Application (funny "A" + garbage)
by tye (Sage) on Jul 25, 2008 at 06:59 UTC

    That means that you are sending UTF-8 to a browser that is expecting Latin-1. That is probably the most common Unicode problem and the "funny 'A' plus a garbage character" in place of some international letter is dead typical.

    If it were just some layer in your DB connection or Perl "helpfully" converting to UTF-8 for you, then you are supposed to get a warning when you try to output this UTF-8 to your http-daemon because you haven't declared that this output (I/O) handle understands UTF-8. So that may mean that your problem is that you've "declared that the CGI output (I/O) handle is expecting UTF-8" (since you mentioned no warning).

    More likely, your DB is giving UTF-8 strings to your Perl and nobody bothered to inform your Perl of this detail. So Perl doesn't know that its string of bytes is actually encoded as UTF-8 characters so Perl can't warn you but is still writing out the bytes of UTF-8-encoded characters (as opposed to knowing that it is writing out UTF-8 characters by writing out the bytes that they are made of).

    Unicode was designed by people who had gotten used to the utopia of "everything is a byte stream" while not realizing that their creation was going to destroy that utopia so their plans were woefully inadequate. (I got to ride a small bit of the tail of the world before everybody just took for granted that everything was a byte stream.)

    In this painful transition world (before we eventually arrive at the designed "everything is a Unicode stream with appropriate BOM or meta data regarding encoding" "utopia" (the term "my(t)opia" springs to mind, especially with regard to the prior paragraph), one often must be quite careful at every layer to ensure that both sides of that layer agree on the expected encoding. And the layers can be quite numerous.

    You have an advantage in this case in that Perl adds the "is this Unicode?" metadata to its strings and (mostly) to its streams, so the odds are that the layer that is currently causing you problems is likely nearly outside of Perl, probably on the database side.

    My first step would be to upgrade the DBD driver module and see if the problem just goes away. The most likely layers to cause problems are the ones where the authors on each side are the least well connected. Although the authors of a DBD module usually try pretty hard to stay well connected to both their database of choice and to Perl, you don't have to go very far back to find a version (of most DBDs) that isn't dealing with Unicode quite the way their database of choice currently does and/or isn't dealing with Unicode quite the way Perl currently does (Unicode support is still a relatively new concept that is still subject to significant "evolution").

    - tye        

Re: OT? Character set issues with MySQL/CGI::Application
by Corion (Pope) on Jul 25, 2008 at 06:16 UTC

    The issue is with consistency. You need to identify what charset is for all places where you handle your data:

    1. Data entry into Application?
    2. Data entry into the DB?
    3. Data query and display through the DB command line tool?
    4. Data retrieval and mixing with the template?
    5. Data output to to the browser?

    In all these stages, a mistake/mixup can happen which will break the chain. To identify it, you will have to look at hexdumps of the data at various stages to see whether the data is (still) how you want it. Especially Templates are prone to encoding mixups in my experience.

      When I inserted the Español into the DB, I used a manual query in the mysql shell and copied and pasted it from a web page.

      I've been trying to "fix" hte table tonight by playing around with the character set, but

      ALTER TABLE Language CONVERT TO latin1;

      has no effect on the data. I still have to pipe it through Encode::decode.

        How do you know what encoding the mysql shell uses? How do you know what your console uses as encoding? How do you know what the web browser did with the encoded data before putting it into the clipboard?

        You need to eliminate as many conversion steps as possible and check the consistency of the remaining. I wouldn't rely on MySQL to upgrade or change the encoding. Use Scalar::UtilDevel::Peek to dump the scalars before sending them to the DB to see whether Perl encodes them as UTF8 or Latin1. Then retrieve them via DBI and check the encoding again. It should be the same as you put in, but maybe it isn't. Also, the strings should ideally be bytewise identical.

        If Encode::decode() works, then likely your data is in UTF8 when you retrieve it from MySQL. If it already is in UTF8 when you put it in, it could be an idea to keep all data in UTF8 throughout your application and just convert at the output stage to Latin1.

        Update: moritz pointed out that I want Devel::Peek, not Scalar::Util

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://700001]
Approved by pc88mxer
[atcroft]: Is there an example somewhere of a way to order nodes when using Tree::DAG_Node?

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2018-04-21 00:42 GMT
Find Nodes?
    Voting Booth?