Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Malformed UTF-8 character error after fetching data from Postgresql

by nihiliath (Initiate)
on Jul 17, 2014 at 15:01 UTC ( #1094072=perlquestion: print w/ replies, xml ) Need Help??
nihiliath has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks

I have this problem:

I have a script fetching some data from a Postgresql database.

The database server encoding is UTF-8, and the script is in cp1251 (windows-1251). After connecting to the database (using DBI) I set the database client encoding to WIN1251, so to tell the database that I want the output to be in this encoding. Then I execute the SQL select query and I fetch the data in a hash reference. When I try to dump the data into the console(or just use it in any other way) I get this warning / error:

Malformed UTF-8 character (unexpected non-continuation byte 0x2e, immediately after start byte 0xf2) in subroutine entry at /usr/local/lib/perl5/5.16/mach/Data/Dumper.pm line 205.

The reason that we use windows-1251 as code encoding is because it's legacy code and we have data in cyrrilic. The database is in UTF8 because we hope that we'll rewrite the code in the future in UTF8...

Can someone tell me why this is happening and how to avoid it?!

I'm using:

FreeBSD 9.2-STABLE #2 r265059 Perl v5.16.3 Postgresql 9.1.13 DBI v1.631 DBD::Pg v3.0.0

My locale settings:

LANG=bg_BG.CP1251 LC_CTYPE="bg_BG.CP1251" LC_COLLATE="bg_BG.CP1251" LC_TIME="bg_BG.CP1251" LC_NUMERIC=C LC_MONETARY="bg_BG.CP1251" LC_MESSAGES="bg_BG.CP1251" LC_ALL=

My database locale:

Name | Owner | Encoding | Collate | Ctype | ----------------+---------+----------+--------------+--------------+ database_name | yavor | UTF8 | C | C |

My script:

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; use DBI; #connection settings my $s_db_name = 'database_name'; my $s_db_user = 'user'; my $s_db_auth = 'pass'; #hash ref for data fetching my $href_row = {}; #connection with transaction my $oref_dbh = DBI->connect('dbi:Pg:dbname=' . $s_db_name, $s_db_user, + $s_db_auth, {AutoCommit => 0, RaiseError => 1}); #getting the default client encoding my $oref_sth = $oref_dbh->prepare("SHOW client_encoding;"); $oref_sth->execute; $href_row = $oref_sth->fetchrow_hashref(); print "\n Default Client Encoding: " . Dumper($href_row); #UTF8 #setting client encoding to be windows-1251 $oref_sth = $oref_dbh->prepare("SET client_encoding = 'WIN1251';"); $oref_sth->execute; #getting the current client encoding $oref_sth = $oref_dbh->prepare("SHOW client_encoding;"); $oref_sth->execute; $href_row = $oref_sth->fetchrow_hashref(); print "\n Current Client Encoding: " . Dumper($href_row); #WIN1251 #preparing the SELECT query $oref_sth = $oref_dbh->prepare(" SELECT id, name_en, host, descr_bg, descr_en FROM application;"); #executing it $oref_sth->execute; #fetching the data $href_row = $oref_sth->fetchall_hashref('id'); #rollback the current transaction $oref_dbh->rollback(); #dumping the fetched data using Data::Dumper print "\n Data: " . Dumper($href_row);

Result from execution:

Default Client Encoding: $VAR1 = { 'client_encoding' => 'UTF8' }; Current Client Encoding: $VAR1 = { 'client_encoding' => 'WIN1251' }; Malformed UTF-8 character (unexpected non-continuation byte 0xee, imme +diately after start byte 0xd1) in subroutine entry at /usr/local/lib/ +perl5/5.16/mach/Data/Dumper.pm line 205. ... Malformed UTF-8 character (1 byte, need 4, after start byte 0xf2) in s +ubroutine entry at /usr/local/lib/perl5/5.16/mach/Data/Dumper.pm line + 205. ... Data: $VAR1 = { '<proper ouput>' => { 'descr_en' => '', 'descr_bg' => '<proper output in the console in cyrri +llic>', 'id' => <proper output>, 'name_en' => '<proper output>', 'host' => '<proper output>' },

Comment on Malformed UTF-8 character error after fetching data from Postgresql
Select or Download Code
Re: Malformed UTF-8 character error after fetching data from Postgresql
by stefbv (Deacon) on Jul 17, 2014 at 17:38 UTC

    The server and client encoding must be the same, there is no automatic conversion (as far as I know). Leave the database as it was until you rewrite the client code.

Re: Malformed UTF-8 character error after fetching data from Postgresql
by sundialsvc4 (Abbot) on Jul 17, 2014 at 18:00 UTC

    Very respectfully to you, stefby, “I’m not so sure that you’re correct on this.

    When I read section 23.3.3 of this PostGres doc page, it states that automatic conversion between client and server datasets is provided, and UTF8<->WIN is a supported combination.

    What I am suspicious of is that Perl is treating the data as UTF8, within Data::Dumper.   It should be easy to query the database directly to be sure that the characters were stored (translated) correctly, and that the received characters are in CP1251.   I suspect that Perl thinks that it’s dealing with UTF8.

      That is interesting, and I found another phrase in the DBD::Pg docs that sounds like I was mistaken:

      pg_enable_utf8 (integer)

      DBD::Pg specific attribute. The behavior of DBD::Pg with regards to this flag has changed as of version 3.0.0. ...

      "Note that the value of client_encoding is only checked on connection time. If you change the client_encoding to/from 'UTF8' after connecting, you can set pg_enable_utf8 to -1 to force DBD::Pg to read in the new client_encoding and act accordingly."

        My instinct here is that Data::Dumper is the one that is confused, not anything in the DBI Stack.

        10x. That seems to be the problem. When I connect to my database the client encoding is UTF8, and by default DBD::Pg set the internal Perl UTF8 flag to true. This cause problems when I change the client encoding after connecting. Perl is asuming that my fetched data should be stored as UTF8 and this is not working. When I set pg_enable_utf8 to 0 everything is fine. Better solution is to set the pg_enable_utf8 flag to -1 after changing the client_encoding as suggested in the DBD::Pg documentation.
Re: Malformed UTF-8 character error after fetching data from Postgresql
by graff (Chancellor) on Jul 19, 2014 at 02:47 UTC
    When I see a statement like this one (from in the OP), I immediately suspect serious problems ahead:

    The database server encoding is UTF-8, and the script is in cp1251 (windows-1251). After connecting to the database (using DBI) I set the database client encoding to WIN1251, so to tell the database that I want the output to be in this encoding.

    Regardless of what you might believe about the character content of text fields in the database, it seems imprudent to assume that every UTF-8 character stored in a given table is going to have a mapping to CP1251.

    If your code is working "perfectly", UTF-8 characters in the database that happen to be non-existent in CP1251 will all be rendered as "?" in the output from your perl script. Maybe that's ok for you, but I wouldn't like it.

    I hope the previous replies have been helpful. I know how tricky it can be to handle DB character encodings "the right way" - in fact, I'd be tempted (for expedience) to just take stuff from the DB as-is and do whatever needs to be done via the Encode module in order to turn it into something else.

    E.g., maybe stuff from the DB would just look like "raw bytes" when the perl script first sees it, so "decode" that into perl-internal UTF-8; then "encode" that into CP1251 for output or whatever else there is to do.

      Yeah there will be a problem if someone enter some data in the DB with characters that aren't supported in CP1251. It shouldn't normally happen... For now this solution solves my problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1094072]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2014-12-27 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls