http://www.perlmonks.org?node_id=1072705

Pickwick has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have an legacy application which uses Postgres to store some data and it's database has been created with the charset WIN1252 for legacy reasons. Internally the application should only work with UTF-8 Perlstrings in the meantime, but there may be places were it doesn't. My problem now is that if I insert some textual data with german umlauts into the database, the result is that I get UTF-8 bytes instead of the proper german umlaut.

I debugged the problem and in the application the strings all look as expected, the root cause seems to be that DBD::Pg encodes the data as UTF-8 instead of WIN1252 before transferring it to the database. This makes me wonder because the clinet encoding is properly detected as WIN1252 automatically and the fact that DBD::Pg can encode to valid UTF-8 looks like my strings are in fact valid Perlstrings.

If I change my application to set the client encoding to UTF-8 or manually encode my strings to WIN1252 everything works as expected, in both cases I get valid german umlauts in the database. Both of course work because if I tell the connection it's UTF-8, the server can recode properly to WIN1252 and if I send WIN1252 myself the server won't change anything but store the bytes 1:1.

From my understanding, if DBD::Pg detects a client encoding of WIN1252 automatically it shouldn't encode the data to send to UTF-8, but WIN1252 itself. But obviously I'm wrong because the same problem exists on a Windows host, but I just didn't realize it before because in this case the target database has been created as UTF-8.

Is it expected behavior that DBD::Pg encodes UTF-8 Perlstrings to UTF-8 bytes before sending them to the server, regardless of the (automatically detected) client encoding? Does this mean that I simply need to always set the client encoding to UTF-8 if I'm sure to have valid UTF-8 Perlstrings internally?

Thanks for your wisdom!

  • Comment on DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding

Replies are listed 'Best First'.
Re: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
by moritz (Cardinal) on Jan 30, 2014 at 18:04 UTC

    I am not an expert when it comes to DBD::Pg, but my understanding is that client encoding and the table encoding do not have to agree. You pass strings in the client encoding (UTF-8) to DBD::Pg, or if you have pg_enable_utf8 set to 1, you simply pass in text strings.

    According to the docs, Postgres automatically recodes from table to client encoding and vice versa if you tell it to use a specific client encoding. So it should be pretty transparent.

      According to the docs, Postgres automatically recodes from table to client encoding and vice versa if you tell it to use a specific client encoding. So it should be pretty transparent.

      That's what I thought as well, but it simply doesn't behave that way on my systems. Client encoding tells WIN1252, but UTF-8 encoded bytes are send. The only thing I have in between is DBIx::Log4perl which logs the statements send to the server and shows that UTF-8 bytes are send.

        DBIx::Log4perl sees the data before it is sent to the database by DBD::Pg so you cannot rely on what you see in its output as DBD::Pg can change the data.

        I took a casual glance at DBD::Pg code and all the UTF8 stuff seemed to be wrapped in pg_enable_utf8. Are you binding the data as parameters when it is inserted?

        The trouble here is there are a number of variables. You database uses 1252 encoding. What is your postgres client charset set to and what is the encoding of the data you pass to DBD::Pg when it fails and do you have pg_enable_utf8 on?

Re: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
by ikegami (Patriarch) on Jan 31, 2014 at 17:17 UTC

    I'm going to describe how DBD::mysql works. I suspect DBD::Pg works the same way.


    Perl has two ways of storing strings. DBI or DBD::mysql looks at the internal buffer of scalars without checking which storage format was used, so every time you pass a string, it's as if you actually passed

    use Encode qw( is_utf8 encode_utf8 ); is_utf8($string) ? encode_utf8($string) : $string

    This is a bug, but it almost always does the right thing.


    Workaround:

    • If you have a decoded string (a string of Unicode code points), you can use the following:
      use Encode qw( encode_utf8 ); $dbh->do("SET NAMES utf8"); my $sth = ...; $sth->execute(encode_utf8($decoded));
    • If you have a string encoded using cp1252, you can use the following:
      use Encode qw( decode ); $dbh->do("SET NAMES utf8"); my $sth = ...; $sth->execute(decode('cp1252', $encoded));
    • If you have a string encoded using cp1252 you want to avoid any encoding and decoding on the Perl side, you can use the following:
      sub _d { my ($s) = @_; utf8::downgrade($_); $s } $dbh->do("SET NAMES cp1252"); my $sth = ...; $sth->execute(_d($encoded));

    Notes:

    • Passing mysql_enable_utf8=>1 to DBI->connect does $dbh->do("SET NAMES utf8"); for you. Later changes to mysql_enable_utf8 does not.
    • is_utf8 always returns true for strings returned by Encode::decode and Encode::decode_utf8.
    • is_utf8 always returns false for strings returned by Encode::encode, Encode::encode_utf8 and Encode::from_to.

      I don't use DBD::mysql these days but I do maintain DBD::ODBC and help maintain DBD::Oracle. I'm not aware of anything in DBI which "tampers" in any way with strings passed to the prepare or bind_param methods.

      I've looked at DBD::mysql code and I cannot see anything that explains your "it's as if you actually passed". The original poster said his inserts were with the do method and no bound params and as far as I can see if he'd done that with DBD::mysql (instead of postgres) the string would have ended up in the mysql client API mysql_stmt_prepare untouched. Are you saying bound parameters work differently in DBD::mysql?

      Other than some rather strange enbabling/disabling of SET NAMES in the code I can only find a few calls to sv_utf8_decode to decode UTF8 data received from mysql server and a test of:

      if (SvUTF8(str)) SvUTF8_on(result);

      in the quote method.

      I'm really interested in this from the point of another DBD maintainer.

        I've looked at DBD::mysql code and I cannot see anything that explains your "it's as if you actually passed".

        Does it use SvPV(sv) without checking separate handling based on SvUTF8(sv)? It might do by using the default char* typemap (e.g. void xsfunc(char* s) { ... }).

        As you can see, the following C function is equivalent to that code:

        use strict; use warnings; use Test::More; use Encode qw( is_utf8 encode_utf8 ); use Inline C => <<'__EOC__'; void candidate(SV* sv) { dXSARGS; STRLEN len; char* buf = SvPV(sv, len); SP[0] = sv_2mortal(newSVpvn(buf, len)); XSRETURN(1); } __EOC__ sub baseline { is_utf8($_[0]) ? encode_utf8($_[0]) : $_[0] } sub _u { my ($s) = @_; utf8::upgrade($s); $s } sub _d { my ($s) = @_; utf8::downgrade($s); $s } sub printable { sprintf("%v04X", $_[0]) } my @tests = ( [ '00-7F', "a" ], [ '80-FF,UTF8=0', _d(chr(0xE9)) ], [ '80-FF,UTF8=1', _u(chr(0xE9)) ], [ '>FF', chr(0x2660) ], ); plan tests => 0+@tests; for (@tests) { my ($test_name, $input) = @$_; my $got = candidate($input); my $expected = baseline($input); #is($got, $expected, $test_name); is(printable($got), printable($expected), $test_name); }
        1..4 ok 1 - 00-7F ok 2 - 80-FF,UTF8=0 ok 3 - 80-FF,UTF8=1 ok 4 - >FF

        I'm really interested in this from the point of another DBD maintainer.

        Using the same configuration for all tests, can you roundtrip all of the strings mentioned earlier (as verified by is or eq)?

        my @tests = ( [ '00-7F', "a" ], [ '80-FF,UTF8=0', _d(chr(0xE9)) ], [ '80-FF,UTF8=1', _u(chr(0xE9)) ], [ '>FF', chr(0x2660) ], ); plan tests => 0+@tests; my $dbh = ... connect and setup as you wish ... my $sth = $dbh->prepare('SELECT ?'); for (@tests) { my ($test_name, $input) = @$_; $sth->execute($input) or die; my $row = $sth->fetch() or die; $sth->finish() or die if $row; my $got = $row->[0]; is(printable($got), printable($input), $test_name); }
Re: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
by erix (Prior) on Feb 04, 2014 at 11:49 UTC