Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding

by mje (Deacon)
on Jan 31, 2014 at 08:50 UTC ( #1072808=note: print w/ replies, xml ) Need Help??


in reply to Re^2: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
in thread DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding

DBIx::Log4perl sees the data before it is sent to the database by DBD::Pg so you cannot rely on what you see in its output as DBD::Pg can change the data.

I took a casual glance at DBD::Pg code and all the UTF8 stuff seemed to be wrapped in pg_enable_utf8. Are you binding the data as parameters when it is inserted?

The trouble here is there are a number of variables. You database uses 1252 encoding. What is your postgres client charset set to and what is the encoding of the data you pass to DBD::Pg when it fails and do you have pg_enable_utf8 on?


Comment on Re^3: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
Re^4: DBD::Pg encodes Perlstring to UTF-8 bytes instead of WIN1252 regardless client encoding
by Pickwick (Beadle) on Jan 31, 2014 at 09:23 UTC
    DBIx::Log4perl sees the data before it is sent to the database by DBD::Pg so you cannot rely on what you see in its output as DBD::Pg can change the data.

    In theory, yes, but practically Log4perl was logging as expected a line before the SQL statement and the output of DBIx makes sense for the problem I described. If I encode the data to WIN1252 before passing it to DBI the logged output of DBIx change as well, therefore my strong guess that DBIx is logging what gets send to the database.

    Are you binding the data as parameters when it is inserted?

    No, the application simply creates a SQL string by concatenating different UTF-8 Perlstrings together, pushes that with a "do" and the whole string formed by SQL command and values gets encoded to UTF-8.

    What is your postgres client charset set to

    It's automatically detected as WIN1252 when the problem occurs, which makes perfectly sense on Windows and the Linux I tested. Another Ubuntu server automatically detects it as UTF-8 instead, because it has an UTF-8 locale. If I manually change it to UTF-8 everything works as expected, the database server properly encodes UTF-8 bytes to WIN1252 characters in the target database.

    and what is the encoding of the data you pass to DBD::Pg when it fails

    Valid UTF-8 Perlstrings with UTF-8 flag turned on.

    and do you have pg_enable_utf8 on?

    No, because from my understanding it is deprecated and only used in reading from the database, not writing to.

      therefore my strong guess that DBIx is logging what gets send to the database

      As the author of DBIx::Log4perl I can guarantee you that DBIx::Log4perl gets any SQL passed to do/prepare before DBI and before DBD::Pg. You seemed to be suggesting that DBD::Pg was changing the encoding on strings and I'm simply saying if it did you would not see this from a DBIx::Log4perl log.

      If you change the encoding of data passed to do/prepare I'd expect the log to change too so that doesn't tell us anything really.

      Looking at the DBD::Pg code, if you don't enable pg_enable_utf8 then it doesn't seem to change any data to or from the database.

      It seems like you are suggesting that when your client chrset is 1252 and you pass UTF8 to DBD::Pg then the data isn't right in the database? I wouldn't expect it to be since postgres thinks it is 1252 but you sent utf8.

      If you have UTF8 encoded data then set your client chrset to utf8. If you want utf8 back then set pg_enable_utf8 - I didn't see a deprecated warning anywhere and I can assure you it is used all over the DBD::Pg code.

        As the author of DBIx::Log4perl I can guarantee you that DBIx::Log4perl gets any SQL passed to do/prepare before DBI and before DBD::Pg.

        In this case I wonder why DBIx logs UTF-8 bytes, whereas Log4perl "outside" of DBI/DBIx does not. There must be some interference there, anything is getting passed a UTF-8 Perlstring and encodes it to UTF-8 bytes.

        It seems like you are suggesting that when your client chrset is 1252 and you pass UTF8 to DBD::Pg then the data isn't right in the database?

        It's wrong if the target database is WIN1252 as well, it works if the target is UTF-8. This is simply because in each case UTF-8 bytes gets transferred and they are 1:1 stored in the target, which results in UTF-8 bytes in a WIN1252 target and properly UTF-8 characters in a UTF-8 target.

        I wouldn't expect it to be since postgres thinks it is 1252 but you sent utf8.

        I don't send UTF-8, I give a valid UTF-8 Perlstring and DBD::Pg should handle the communication on it's own. But it doesn't, it always sends UTF-8 bytes regardless of the client encoding, but the client encoding is recognized by the server to reinterpret what it gets. If client encoding is UTF-8 it matches the data sent and the server can encode properly into WIN1252 for WIN1252 target database, but if both differ I get garbage in the target and that's what I don't understand: Why does DBD::Pg always sends UTF-8 bytes and not WIN1252 if the client encoding says so?

        Regarding the docs there should be automatic conversion depending on the client encoding. But it doesn't work this way.

        If you have UTF8 encoded data then set your client chrset to utf8.

        But especially with having UTF-8 strings on the client DBD::Pg should be able to convert into any charset it likes. I don't understand why I'm forced to set the client encoding to some internal representation DBD::Pg sends over the wire and why this doesn't seem to be documented. The documentation says otherwise, that conversion takes place automatically between client and server and I understand that for sending data as well.

        If you want utf8 back then set pg_enable_utf8 - I didn't see a deprecated warning anywhere and I can assure you it is used all over the DBD::Pg code.

        Regarding this source it gets deprecated in the future, but my problem is not with reading data anyways.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1072808]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2014-08-23 07:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (172 votes), past polls