comment on

My use of encode is solely for OUTPUTting the results to the console.

I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output.

My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling.

If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it.

Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?

As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters).

There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later).

A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly.

If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

In reply to Re^3: A UTF8 round trip with MySQL by Joost
in thread A UTF8 round trip with MySQL by clinton

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.