Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I've now had a chance to look at this more as I was a bit surprised you had to do this.

The main issue is you are using verchar columns and not nvarchar columns and you did not tell us how the data got into the database in the first place. Let's assume the data was inserted into the database with DBD::ODBC in the first place and see what happens

use strict; use warnings; use DBI qw(:utils :sql_types); use 5.016; use Data::Dumper; use Encode; my $h = DBI->connect() or die $DBI::errstr; $h->{RaiseError} = 1; eval { $h->do(q/drop table chinese/); }; $h->do(q/create table chinese (n varchar(30), d varchar(12))/); my $str = "\x{20020}\x{24202}\x{31185}\x{23460}"; my $encoded = encode('UTF-8', $str); foreach my $s (split(//, $encoded)) { print sprintf("%x,", ord($s)); } say ""; say data_string_desc($str); my $s = $h->prepare(q/insert into chinese values(?,?)/); $s->execute($str, "111"); $s = $h->prepare(q/select cast(n as varbinary) from chinese/); $s->execute; my $r = $s->fetchall_arrayref; foreach my $s (split(//, $r->[0][0])) { print sprintf("%x,", ord($s)); } say ""; say data_string_desc($r->[0][0]); my $d = decode('UTF-8', $r->[0][0]); say data_string_desc($d); print Dumper(\$d);

which outputs:

f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0, UTF8 on, non-ASCII, 4 characters 16 bytes f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0, UTF8 off, non-ASCII, 16 characters 16 bytes UTF8 on, non-ASCII, 4 characters 16 bytes $VAR1 = \"\x{20020}\x{24202}\x{31185}\x{23460}";

When I look under the hood at what ODBC APIs DBD::ODBC is calling I see DBD::ODBC took the 4 chinese characters that are UTF-8 encoded in Perl (see first lines of output) and it converted them to UTF16 and bound them as SQL_WCHARs. Note, varchar columns are not really designed to store unicode data. If we convert what SQL Server thinks it has in the column to varbinary and read it back we miracoulously get the chinese string UTF-8 encoded back (the 3rd/4th lines of output). Unsurprisingly we can decode that UTF-8 back into a string in Perl (the last 2 lines of output). In other words, SQL Server does not know that is UTF-8 encoded data and length functions and collation will not work.

If we rewrite the select code to a simple "select * from chinese" we seem to get back rubbish. What has happened here is that DBD::ODBC bound what it believed to be char data as SQL_WCHARs, SQL Server passed back UTF16 encoded representation of the 16 bytes and DBD::ODBC decoded it to UTF-8 and hence it is now double encoded UTF-8.

If we had changed the code above to bind the input data as SQL_WCHARs SQL Server would have just put a load of question marks in the column as it cannot do what you want.

The correct way to do this with SQL Server is to make the column nvarchar and then everything will just work (except collations when characters don't fit into UCS-2 until SQL Server 2012 - see Inserting unicode characters > 0xFFFF (surrogate pairs) into MS SQL Server with Perl DBD::ODBC for why).

In reply to Re^3: Perl DBI can't display Chinese text? by mje
in thread Perl DBI can't display Chinese text? by Thai Heng

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (3)
    As of 2018-06-23 23:52 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (126 votes). Check out past polls.