In ASCII, a character always maps to a single byte.
In UTF-8, a character may map to multiple bytes.
For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag.
Nothing external to Perl (eg the console or the database) knows about this flag, so we need to transform all input/output data into a form that each program understands, which we do using Encode.
To convert an input string of bytes which represents a UTF-8 string, into Perl's internal string format, we DECODE the byes to from UTF-8, using Encode::decode() or Encode::decode_utf8()
To convert a Perl string into a string of bytes representing a UTF-8 for other programs to understand, we ENCODE the string using Encode::encode() or Encode::encode_utf8()
An example round trip
The steps below demonstrate how to accept UTF-8 strings from outside, store them in MySQL, retrieve them from the database, and re-output them
Manual, or with Automatic with
PerlIO layers MySQL::dbd > v 4
--------------- ----------------
--------- --------- -----
+-----
¦ UTF-8 ¦ ---decode_utf8()->> ¦ Perl ¦ ---encode_utf8()->> ¦ UTF
+-8 ¦
¦ Console ¦ <<-encode_utf8()--- ¦ strings ¦ <<-decode_utf8()--- ¦ MySQ
+L DB ¦
--------- --------- -----
+-----
Create a table in MySQL which uses the UTF-8 character set:
This step ensures that all UTF-8 aware programs that interact with this database know to treat the stored data as UTF-8
CREATE TABLE test_db.test ( string VARCHAR(50) ) CHARACTER SET utf
+8;
Get a UTF-8 string:
This step accepts a string of bytes representing a UTF-8 string, and converts them into Perl's internal string format.
-
From a UTF-8 console:
use Encode qw( decode_utf8 );
my $string = <>;
my $utf8_string = decode_utf8($string);
-
or, from an ISO-8859-1 console:
use Encode qw( decode );
my $iso_8859_string = <>;
my $utf8_string = decode('ISO-8859-1',$iso_8859_string);
- or from within a Perl script:
use utf8; # Tells Perl that the script itself is written i
+n UTF-8
my $utf8_string = "UTF-8 string with special chars: ñ æ ô";
Open a UTF-8 enabled database connection:
This step connects to the database, and tells DBD::mysql to auto-convert to/from UTF-8.
IMPORTANT: This requires a version of DBD::mysql greater than version 4
use DBI();
my $dbh = DBI->connect ('dbi:mysql:test_db',
$username,
$password,
{mysql_enable_utf8 => 1}
);
Write to and read from the DB:
$dbh->do('INSERT INTO test_db.test VALUES(?)', $utf8_string);
$dbh->do('SELECT string FROM test_db.test LIMIT 1');
my $new_string = $dbh->fetchrow_arrayref->[0];
Display the retrieved string:
The output data needs to be converted from Perl's internal format into a string of bytes that the console will understand.
- on a UTF-8 console:
use Encode qw( encode_utf8 );
print Encode::encode_utf8($new_string);
OR
# Add an auto-encoding layer
binmode (STDIN,':utf8');
print $new_string;
-
or, on an ISO-8859-1 console:
use Encode qw( encode );
print Encode::encode('ISO-8859-1', $new_string);
For more info, see perlunitut: Unicode in Perl, perluniintro, perlunicode, perlrun, binmode, open and PerlIO.
UPDATE - Added readmore tags. Added diagram illustrating round trip
UPDATE - Corrected a type: TO utf8, not from utf8
Re: A UTF8 round trip with MySQL
by Joost (Canon) on Jun 13, 2007 at 02:22 UTC
|
DBD::mysql 4.004 ( and higher with mysql_enable_utf8 will (or, should) correctly detect any retrieved utf-8 columns (i.e. it will set the utf8 flag on them). That means that retrieved strings will just do the right thing.
update: really, don't use any lower version for this - 4.003 and 4.004 contain some important utf8 fixes
In other words, you should probably not use encode_utf8() on query results. Especially since that discards the utf8 flag.
INSERTing data into utf-8 columns is trickier. As far as I know you still need to make sure the data is really utf-8 encoded. You can use utf8::upgrade() for that.
Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.
| [reply] |
|
you should probably not use encode_utf8() on query results
Query results are data coming into Perl, so if anything you would use decode or decode_utf8, but since DBD::mysql version 4, with mysql_enable_utf8, the driver handles this automatically. My use of encode is solely for OUTPUTting the results to the console.
Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.
I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no? My understanding is that you make sure that Perl understands what your string contains (eg use decode_utf8 to convert input into Perl's internal string format) - this may or may not set the UTF8 flag depending on what characters the string contains. The driver will then handle the conversion to UTF-8 bytes for storage in the database.
Clint
| [reply] [d/l] [select] |
|
My use of encode is solely for OUTPUTting the results to the console.
I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output.
My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling.
If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it.
Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.
I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?
As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters).
There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later).
A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly.
If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that.
| [reply] |
|
|
|
|
I am using dbd::mysql 4.020, and no single utf8 problem seems to be fixed (yes, I have mysql_enable_utf8 on).
| [reply] |
Re: A UTF8 round trip with MySQL
by Juerd (Abbot) on Jun 13, 2007 at 19:49 UTC
|
That is a nice summary, although the only MySQL specific thing in it is {mysql_enable_utf8 => 1} :-).
You have one misleading bit of information, though:
For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag.
This is a very dangerous assumption! The UTF8 flag is an internal flag that has nothing to do with anything that is external. If it is set, Perl assumes that it wrote the UTF8 buffer itself, and does no further checks. Blindly setting the UTF8 flag is dangerous because it can lead to internally corrupted scalars: malformed UTF8 data.
The :utf8 layer should not be used on input filehandles. Use :encoding(UTF-8) instead.
The _utf8_on function should not be used on external input. Use decode("UTF-8", ...), or possibly decode("UTF8", ...) or decode_utf8(...) instead. You do this correctly.
The UTF8 flag indicates that internal data is UTF8 encoded, and that is regardless of source and history of this string.
| [reply] |
|
open(my $fh, "<:utf8", $fn);
thanks
Clint | [reply] [d/l] |
|
It seems at odds with the docs for the open function and perlopentut, both of which give examples using it
Ah, more documentation needs updates then! I'll look into it; thanks for the pointers.
binmode in perlfunc, in the current development tree, already has the following change:
-To mark FILEHANDLE as UTF-8, use C<:utf8>.
+To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid
+UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less
+efficient) choice.
| [reply] [d/l] |
|
|
Using "<:utf8" has worked fine for me so far. However, juerd does know about this stuff. I'd be interested to know the risks involved.
| [reply] |
|
|
|
|
Re: A UTF8 round trip with MySQL
by Burak (Chaplain) on Jun 27, 2007 at 22:42 UTC
|
This is what I do when communicating with MySQL (code might not be valid -- some sort of pseudo code --):
$dbh = connect;
# fix connection for non-utf8 servers
# execute this immediately
$dbh->do("
SET character_set_client = utf8;
SET character_set_connection = utf8;
SET character_set_results = utf8;
");
$data = fetch;
$data = decode_utf8 $data;
binmode STDOUT, ':utf8';
print $data;
| [reply] [d/l] |
|
$data = fetch;
$data = decode_utf8 $data;
The bit you're missing here is deciding which data to decode, because of course you need to handle text and binary data differently. The latest DBD::mysql drivers handle this automatically.
Clint
| [reply] |
|
Well, since I don't deal with binary data right now, I didn't think about that part :) Ok, I'll check the new interface, but I want to stay compatible with old versions too...
| [reply] |
|
| [reply] |
|
mysql_enable_utf8 is not an auto feature, so it has no effect unless enabled. If it is in effect and I also use Encode methods, this is like shooting yourself in the foot :)
| [reply] |
|
|
decode_utf8 will only fail if the input is invalid utf-8. whether the data is flagged as utf-8 or not does not matter.
| [reply] |
|
I use perl -C (or sets the shell environment variabel PERL_UNICODE=""), and then you do not need any utf8 specific parameter in the code unless the column (or table) names are non-ascii. If they are, mysql_enable_utf8 => 1 have to be set.
$dbh = DBI->connect($dsn, $username, $password, {mysql_enable_utf8 =>
+1}) || die "Could not connect to $database: $DBI::errstr";
$sth->execute || die "Could not execute statement: ".$sth->errstr."\n"
+;
$sth = $dbh->prepare("select foo from bar where id = '3064'") || die "
+Could not prepare statement: ".$dbh->errstr."\n";
my $string = $sth->fetchrow_array;
print $string . "\n";
Will print utf8 to STDOUT. | [reply] [d/l] |
|
|