Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I've now had a chance to look at this more as I was a bit surprised you had to do this.

The main issue is you are using verchar columns and not nvarchar columns and you did not tell us how the data got into the database in the first place. Let's assume the data was inserted into the database with DBD::ODBC in the first place and see what happens

use strict; use warnings; use DBI qw(:utils :sql_types); use 5.016; use Data::Dumper; use Encode; my $h = DBI->connect() or die $DBI::errstr; $h->{RaiseError} = 1; eval { $h->do(q/drop table chinese/); }; $h->do(q/create table chinese (n varchar(30), d varchar(12))/); my $str = "\x{20020}\x{24202}\x{31185}\x{23460}"; my $encoded = encode('UTF-8', $str); foreach my $s (split(//, $encoded)) { print sprintf("%x,", ord($s)); } say ""; say data_string_desc($str); my $s = $h->prepare(q/insert into chinese values(?,?)/); $s->execute($str, "111"); $s = $h->prepare(q/select cast(n as varbinary) from chinese/); $s->execute; my $r = $s->fetchall_arrayref; foreach my $s (split(//, $r->[0][0])) { print sprintf("%x,", ord($s)); } say ""; say data_string_desc($r->[0][0]); my $d = decode('UTF-8', $r->[0][0]); say data_string_desc($d); print Dumper(\$d);

which outputs:

f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0, UTF8 on, non-ASCII, 4 characters 16 bytes f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0, UTF8 off, non-ASCII, 16 characters 16 bytes UTF8 on, non-ASCII, 4 characters 16 bytes $VAR1 = \"\x{20020}\x{24202}\x{31185}\x{23460}";

When I look under the hood at what ODBC APIs DBD::ODBC is calling I see DBD::ODBC took the 4 chinese characters that are UTF-8 encoded in Perl (see first lines of output) and it converted them to UTF16 and bound them as SQL_WCHARs. Note, varchar columns are not really designed to store unicode data. If we convert what SQL Server thinks it has in the column to varbinary and read it back we miracoulously get the chinese string UTF-8 encoded back (the 3rd/4th lines of output). Unsurprisingly we can decode that UTF-8 back into a string in Perl (the last 2 lines of output). In other words, SQL Server does not know that is UTF-8 encoded data and length functions and collation will not work.

If we rewrite the select code to a simple "select * from chinese" we seem to get back rubbish. What has happened here is that DBD::ODBC bound what it believed to be char data as SQL_WCHARs, SQL Server passed back UTF16 encoded representation of the 16 bytes and DBD::ODBC decoded it to UTF-8 and hence it is now double encoded UTF-8.

If we had changed the code above to bind the input data as SQL_WCHARs SQL Server would have just put a load of question marks in the column as it cannot do what you want.

The correct way to do this with SQL Server is to make the column nvarchar and then everything will just work (except collations when characters don't fit into UCS-2 until SQL Server 2012 - see Inserting unicode characters > 0xFFFF (surrogate pairs) into MS SQL Server with Perl DBD::ODBC for why).


In reply to Re^3: Perl DBI can't display Chinese text? by mje
in thread Perl DBI can't display Chinese text? by Thai Heng

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-03-29 00:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found