Re: how to change unicode string to byte array ?

You need to look at how your MySQL database table is defined, and how the text field contents are actually stored in it. Use some client other than a cgi script to investigate this, if the nature of the storage is not clear to you. For example:

#!/usr/bin/perl

use strict;
use DBI;
use Encode;

# set up connection parameters $dbname, $host, $user, $password, then:

my $dbh = DBI->connect( "DBI::mysql:database=$dbname;host=$host", $use
+r, $password );

my $sth = $dbh->prepare( "select item_name from table1" );
$sth->execute;
my ( $is_wide_utf8, $is_not_utf8, $is_ascii );
while ( my ( $item ) = $sth->fetchrow_array )
{
    my $nbytes = length( $item );
    my $nchars = eval "length( decode( 'utf8', $item, Encode::FB_CROAK
+ ))";
    if ( $@ ) {  # decode croaked -- text string is not utf8
        $is_not_utf8++;
        # figure out something else to try (maybe shiftjis?)
    }
    elsif ( $nchars == $nbytes ) {
        $is_ascii++;
    }
    else {
        $is_wide_utf8++;
    }
}
print "$is_ascii entries found to be ascii-only\n" if ( $is_ascii );
print "$is_wide_utf8 entries found to be utf8\n" if ( $is_wide_utf8);
print "$is_not_uft8 entries found to be non-utf8\n"  if ( $is_not_utf8
+ );
[download]

If the results show that your database contains only ascii characters, it probably means that the numeric character entities (like "ஸ") are being stored in the table (you should print some of the entries to confirm this, and you should probably look into fixing things so that you store actual utf8 characters instead).

Next, in your cgi script, you need to figure out what sort of data is coming from your clients -- numeric entities? actual utf8 characters?

As for doing table searches for unicode strings based on input from a cgi client, the best setup is:

The database table stores utf8 strings in the relevant text fields
The cgi script converts the client input to a utf8 string
The query is prepared using a placeholder in the search condition
The query is executed with the utf8 string as the parameter value for the placeholder

In other words, the cgi script might do something like this:

my $srchString = $cgi-param( ... );
$srchString =~ s/\&#(\d+);/chr($1)/ge;  # convert numeric entities to 
+characters

$srchString .= '%' unless $srchString =~ /%$/;  # add a wildcard for t
+he search

my $sth = $dbh->prepare( "select item_name,blah from table1 where item
+_name like ?"
$sth->execute( $srchString );
my $rows = $sth->fetchall_arrayref;

for my $row ( @$rows ) {
    my ( $item, $blah ) = @$row;
    $item = decode( 'utf8', $item );  # make sure perl knows this is u
+tf8 data
    # ... do other stuff...
}
[download]

Comment on Re: how to change unicode string to byte array ? Select or Download Code

Replies are listed 'Best First'.
Re^2: how to change unicode string to byte array ? by nythu (Initiate) on Sep 29, 2008 at 19:05 UTC
Thanks so much graff and others I now understand a little more about data types. may be my data is in wrong format in table. And i found that i also have some problems on communication between program and client as Burak advised me. For this encoding mismatch in communication, i used as "default-character-set=sjis" and " skip-character-set-client-handshake" in my.ini of mysql. Is this ok when i upload to some real hosting ? i convert like that and i used sjis to store data in database.Then, the query problem of "SELECT ... WHERE field1 LIKE "$input"."%"become OK since user input is also in sjis encoding. Anyway, i am really grateful for your good advises. i have some other related problem. this is to convert ascii character 'A' to sjis (japanese full-width alphanumeric character) 'Ａ' . But I am not sure that "the solution may be only known by the people who are working in japanese character" or such problem also exists in other unicode characters.	[reply]
Re^3: how to change unicode string to byte array ? by graff (Chancellor) on Sep 29, 2008 at 23:30 UTC
i have some other related problem. this is to convert ascii character 'A' to sjis (japanese full-width alphanumeric character) 'Ａ' . The easiest way might be to convert "plain ascii" characters into the unicode "fullwidth" characters first, then "Encode::encode()" the unicode string into shiftjis for output. The unicode range for the "fullwidth" ASCII characters begins at \x{FF01} for "!" (exclamation mark), and then proceeds through the normal ASCII character sequence up through \x{FF5E}. So one way to convert "plain ASCII" to "fullwidth ASCII" would be like this: `use Encode; # pick one of the following lines to uncomment # (depends on what sort of display terminal you use): #binmode STDOUT, ":utf8"; #binmode STDOUT, ":encoding(shiftjis)"; $_ = join( '', map { chr() } 0x21 .. 0x7e ) . "\n"; # this is a test +string print; s/([!-~])/chr(ord($1)+0xFEE0)/eg; # convert plain ASCII to CJK fullwi +dth ASCII # print; # uncomment this when you have chosen which binmode line to +use` [download] (update: adding missing colon in 2nd binmode line)	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks