A UTF8 round trip with MySQL

In ASCII, a character always maps to a single byte.
In UTF-8, a character may map to multiple bytes.
For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag.
Nothing external to Perl (eg the console or the database) knows about this flag, so we need to transform all input/output data into a form that each program understands, which we do using Encode.
To convert an input string of bytes which represents a UTF-8 string, into Perl's internal string format, we DECODE the byes to from UTF-8, using Encode::decode() or Encode::decode_utf8()
To convert a Perl string into a string of bytes representing a UTF-8 for other programs to understand, we ENCODE the string using Encode::encode() or Encode::encode_utf8()

An example round trip

The steps below demonstrate how to accept UTF-8 strings from outside, store them in MySQL, retrieve them from the database, and re-output them


                Manual, or with              Automatic with 
                PerlIO layers                MySQL::dbd > v 4
                ---------------              ---------------- 
 ---------                       ---------                       -----
+-----
Ś  UTF-8  Ś ---decode_utf8()->> Ś  Perl   Ś ---encode_utf8()->> Ś  UTF
+-8   Ś
Ś Console Ś <<-encode_utf8()--- Ś strings Ś <<-decode_utf8()--- Ś MySQ
+L DB Ś
 ---------                       ---------                       -----
+-----
[download]

Create a table in MySQL which uses the UTF-8 character set:

This step ensures that all UTF-8 aware programs that interact with this database know to treat the stored data as UTF-8
```
    CREATE TABLE test_db.test ( string VARCHAR(50) ) CHARACTER SET utf
+8;
[download]
```

Get a UTF-8 string:

This step accepts a string of bytes representing a UTF-8 string, and converts them into Perl's internal string format.

From a UTF-8 console:


    use Encode qw( decode_utf8 );
    my $string      = <>;
    my $utf8_string = decode_utf8($string);
[download]

or, from an ISO-8859-1 console:


   use Encode qw( decode );
   my $iso_8859_string = <>;
   my $utf8_string     = decode('ISO-8859-1',$iso_8859_string);
[download]

or from within a Perl script:


    use utf8;         # Tells Perl that the script itself is written i
+n UTF-8

    my $utf8_string = "UTF-8 string with special chars: ń ć ô";
[download]

Open a UTF-8 enabled database connection:

This step connects to the database, and tells DBD::mysql to auto-convert to/from UTF-8.

IMPORTANT: This requires a version of DBD::mysql greater than version 4


    use DBI();
    my $dbh = DBI->connect ('dbi:mysql:test_db',
                            $username,
                            $password, 
                            {mysql_enable_utf8 => 1}
    );
[download]

Write to and read from the DB:


    $dbh->do('INSERT INTO test_db.test VALUES(?)', $utf8_string);

    $dbh->do('SELECT string FROM test_db.test LIMIT 1');
    my $new_string = $dbh->fetchrow_arrayref->[0];
[download]

Display the retrieved string:

The output data needs to be converted from Perl's internal format into a string of bytes that the console will understand.

on a UTF-8 console:


    use Encode qw( encode_utf8 );
    print Encode::encode_utf8($new_string);

 OR

    # Add an auto-encoding layer
    binmode (STDIN,':utf8');
    print $new_string;
[download]

or, on an ISO-8859-1 console:


    use Encode qw( encode );
    print Encode::encode('ISO-8859-1', $new_string);
[download]

For more info, see perlunitut: Unicode in Perl, perluniintro, perlunicode, perlrun, binmode, open and PerlIO.

UPDATE - Added readmore tags. Added diagram illustrating round trip

UPDATE - Corrected a type: TO utf8, not from utf8

Comment on A UTF8 round trip with MySQL Select or Download Code

Replies are listed 'Best First'.
Re: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 02:22 UTC
DBD::mysql 4.004 ( and higher with mysql_enable_utf8 will (or, should) correctly detect any retrieved utf-8 columns (i.e. it will set the utf8 flag on them). That means that retrieved strings will just do the right thing. update: really, don't use any lower version for this - 4.003 and 4.004 contain some important utf8 fixes In other words, you should probably not use encode_utf8() on query results. Especially since that discards the utf8 flag. INSERTing data into utf-8 columns is trickier. As far as I know you still need to make sure the data is really utf-8 encoded. You can use utf8::upgrade() for that. Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: A UTF8 round trip with MySQL by clinton (Priest) on Jun 13, 2007 at 08:26 UTC
you should probably not use encode_utf8() on query results Query results are data coming into Perl, so if anything you would use `decode` or `decode_utf8`, but since DBD::mysql version 4, with `mysql_enable_utf8`, the driver handles this automatically. My use of `encode` is solely for OUTPUTting the results to the console. Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly. I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no? My understanding is that you make sure that Perl understands what your string contains (eg use `decode_utf8` to convert input into Perl's internal string format) - this may or may not set the `UTF8 flag` depending on what characters the string contains. The driver will then handle the conversion to UTF-8 bytes for storage in the database. Clint	[reply] [d/l] [select]
Re^3: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 10:41 UTC
My use of encode is solely for OUTPUTting the results to the console. I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output. My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling. If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it. Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly. I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no? As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters). There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later). A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly. If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: A UTF8 round trip with MySQL by clinton (Priest) on Jun 13, 2007 at 11:01 UTC
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 11:28 UTC
Some notes below your chosen depth have not been shown here
Re^2: A UTF8 round trip with MySQL by Anonymous Monk on Jul 25, 2013 at 13:34 UTC
I am using dbd::mysql 4.020, and no single utf8 problem seems to be fixed (yes, I have mysql_enable_utf8 on).	[reply]
Re: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 19:49 UTC
That is a nice summary, although the only MySQL specific thing in it is `{mysql_enable_utf8 => 1}` :-). You have one misleading bit of information, though: For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag. This is a very dangerous assumption! The UTF8 flag is an internal flag that has nothing to do with anything that is external. If it is set, Perl assumes that it wrote the UTF8 buffer itself, and does no further checks. Blindly setting the UTF8 flag is dangerous because it can lead to internally corrupted scalars: malformed UTF8 data. The `:utf8` layer should not be used on input filehandles. Use `:encoding(UTF-8)` instead. The `_utf8_on` function should not be used on external input. Use `decode("UTF-8", ...)`, or possibly `decode("UTF8", ...)` or `decode_utf8(...)` instead. You do this correctly. The UTF8 flag indicates that internal data is UTF8 encoded, and that is regardless of source and history of this string. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^2: A UTF8 round trip with MySQL by clinton (Priest) on Jun 13, 2007 at 20:13 UTC
Thanks Juerd The :utf8 layer should not be used on input filehandles. Use :encoding(UTF-8) instead. Why do you say this? It seems at odds with the docs for the open function and perlopentut, both of which give examples using it: `open(my $fh, "<:utf8", $fn);` [download] thanks Clint	[reply] [d/l]
Re^3: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 20:45 UTC
It seems at odds with the docs for the open function and perlopentut, both of which give examples using it Ah, more documentation needs updates then! I'll look into it; thanks for the pointers. binmode in perlfunc, in the current development tree, already has the following change: `-To mark FILEHANDLE as UTF-8, use C<:utf8>. +To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid +UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less +efficient) choice.` [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l]
Re^4: A UTF8 round trip with MySQL by Anonymous Monk on Jul 30, 2013 at 13:28 UTC
Re^3: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 20:34 UTC
Using "<:utf8" has worked fine for me so far. However, juerd does know about this stuff. I'd be interested to know the risks involved. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 20:52 UTC
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 21:01 UTC
Some notes below your chosen depth have not been shown here
Re^5: A UTF8 round trip with MySQL by mje (Curate) on Mar 31, 2009 at 10:24 UTC
Re: A UTF8 round trip with MySQL by Burak (Chaplain) on Jun 27, 2007 at 22:42 UTC
This is what I do when communicating with MySQL (code might not be valid -- some sort of pseudo code --): `$dbh = connect; # fix connection for non-utf8 servers # execute this immediately $dbh->do(" SET character_set_client = utf8; SET character_set_connection = utf8; SET character_set_results = utf8; "); $data = fetch; $data = decode_utf8 $data; binmode STDOUT, ':utf8'; print $data;` [download]	[reply] [d/l]
Re^2: A UTF8 round trip with MySQL by clinton (Priest) on Jun 28, 2007 at 07:53 UTC
$data = fetch; $data = decode_utf8 $data; The bit you're missing here is deciding which data to decode, because of course you need to handle text and binary data differently. The latest DBD::mysql drivers handle this automatically. Clint	[reply]
Re^3: A UTF8 round trip with MySQL by Burak (Chaplain) on Jun 28, 2007 at 21:20 UTC
Well, since I don't deal with binary data right now, I didn't think about that part :) Ok, I'll check the new interface, but I want to stay compatible with old versions too...	[reply]
Re^2: A UTF8 round trip with MySQL by Joost (Canon) on Jun 28, 2007 at 18:01 UTC
decode_utf8 will fail when $data is already flagged as utf8 (i.e. when you're using mysql_enable_utf8 on a recent DBD::mysql version). Also, as clinton noted, your code also requires you to keep track of which columns are binary and which are text. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: A UTF8 round trip with MySQL by Burak (Chaplain) on Jun 28, 2007 at 21:26 UTC
mysql_enable_utf8 is not an auto feature, so it has no effect unless enabled. If it is in effect and I also use Encode methods, this is like shooting yourself in the foot :)	[reply]
Re^4: A UTF8 round trip with MySQL by Joost (Canon) on Jun 28, 2007 at 21:33 UTC
Re^3: A UTF8 round trip with MySQL by Anonymous Monk on Jul 30, 2013 at 13:30 UTC
decode_utf8 will only fail if the input is invalid utf-8. whether the data is flagged as utf-8 or not does not matter.	[reply]
Re^2: A UTF8 round trip with MySQL by Anonymous Monk on Mar 30, 2009 at 18:49 UTC
I use perl -C (or sets the shell environment variabel PERL_UNICODE=""), and then you do not need any utf8 specific parameter in the code unless the column (or table) names are non-ascii. If they are, mysql_enable_utf8 => 1 have to be set. `$dbh = DBI->connect($dsn, $username, $password, {mysql_enable_utf8 => +1}) \|\| die "Could not connect to $database: $DBI::errstr"; $sth->execute \|\| die "Could not execute statement: ".$sth->errstr."\n" +; $sth = $dbh->prepare("select foo from bar where id = '3064'") \|\| die " +Could not prepare statement: ".$dbh->errstr."\n"; my $string = $sth->fetchrow_array; print $string . "\n";` [download] Will print utf8 to STDOUT.	[reply] [d/l]


go ahead... be a heretic
	PerlMonks