Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

how to change unicode string to byte array ?

by nythu (Initiate)
on Sep 28, 2008 at 17:04 UTC ( #714195=perlquestion: print w/replies, xml ) Need Help??
nythu has asked for the wisdom of the Perl Monks concerning the following question:

hello.. i am new to cgi perl and now i am writing web site that use mysql database on japanese font. I store the data in utf8 and the data in the mysql database table are as the following form. (this is just example) --------------- item_name --------------- &'#20013;&'#30000; , &'#12493;&'#12483;&'#12488:&'#12527:&'#12540:&'#12463: , &'#12469;&'#12540:&'#12496:&'#12540: , &'#65315;&'#65313:&'#65330; , ( i added apostrophe(') codes here for appearing codes here, real data does not include (')) and real readable data format is .. ----- item_name ----- 中田 , ネットワーク , サーバー , CAR (this is japanese full width alphanumeric character, not normal Ascii character) ............. My problem is that when i try to search this name by SQL statement , how can i change the user input data to this code array format(like "&'#20013;&'#30000;" ).(here also added apostrophe(') codes for appearing codes ) I have tried with some code. It was OK to convert one character. I tried like that. my $code = sprintf "%s", ord Encode::decode("sjis", $char); this works for single character. So I tried for split string into character array and to convert each character. But it was not ok.I tried for splitting unicode string into character array using "split()" like that. my @StrAry = split(//,$srcStr); But it was only ok for normal string and cannot divide unicode string. So, please let me know if there any good solution for that. I want to write for the sql statement like "SELECT * FROM table1 WHERE item_name LIKE 'ネットワ%' " and which should result ネットワーク I tried like this and it was oK "SELECT * FROM table1 WHERE item_name LIKE 'ネ%' " and which reults well. But i cannot change user input string to related code array. So, please help me if anyone know about unicode and mysql ,cgi perl. Thnaks in advance thu
  • Comment on how to change unicode string to byte array ?

Replies are listed 'Best First'.
Re: how to change unicode string to byte array ?
by moritz (Cardinal) on Sep 28, 2008 at 18:23 UTC
    Sorry, your description is too cluttered and not very good readable (did you read Writeup Formatting Tips?) to make sense to me.

    But there are a few things you should know: You can downgrade your string with Encode::encode_utf8 into a "byte string", the reverse function is decode_utf8 in the same module.

    What happens to the strings inside your database is subject to the DB configuration and the options to the DBD module you're using (DBD::mysql probably).

    See also encodings and Unicode, Encode, perluniintro and perlunifaq.

Re: how to change unicode string to byte array ?
by graff (Chancellor) on Sep 28, 2008 at 19:02 UTC
    You need to look at how your MySQL database table is defined, and how the text field contents are actually stored in it. Use some client other than a cgi script to investigate this, if the nature of the storage is not clear to you. For example:
    #!/usr/bin/perl use strict; use DBI; use Encode; # set up connection parameters $dbname, $host, $user, $password, then: my $dbh = DBI->connect( "DBI::mysql:database=$dbname;host=$host", $use +r, $password ); my $sth = $dbh->prepare( "select item_name from table1" ); $sth->execute; my ( $is_wide_utf8, $is_not_utf8, $is_ascii ); while ( my ( $item ) = $sth->fetchrow_array ) { my $nbytes = length( $item ); my $nchars = eval "length( decode( 'utf8', $item, Encode::FB_CROAK + ))"; if ( $@ ) { # decode croaked -- text string is not utf8 $is_not_utf8++; # figure out something else to try (maybe shiftjis?) } elsif ( $nchars == $nbytes ) { $is_ascii++; } else { $is_wide_utf8++; } } print "$is_ascii entries found to be ascii-only\n" if ( $is_ascii ); print "$is_wide_utf8 entries found to be utf8\n" if ( $is_wide_utf8); print "$is_not_uft8 entries found to be non-utf8\n" if ( $is_not_utf8 + );
    If the results show that your database contains only ascii characters, it probably means that the numeric character entities (like "ஸ") are being stored in the table (you should print some of the entries to confirm this, and you should probably look into fixing things so that you store actual utf8 characters instead).

    Next, in your cgi script, you need to figure out what sort of data is coming from your clients -- numeric entities? actual utf8 characters?

    As for doing table searches for unicode strings based on input from a cgi client, the best setup is:

    • The database table stores utf8 strings in the relevant text fields
    • The cgi script converts the client input to a utf8 string
    • The query is prepared using a placeholder in the search condition
    • The query is executed with the utf8 string as the parameter value for the placeholder
    In other words, the cgi script might do something like this:
    my $srchString = $cgi-param( ... ); $srchString =~ s/\&#(\d+);/chr($1)/ge; # convert numeric entities to +characters $srchString .= '%' unless $srchString =~ /%$/; # add a wildcard for t +he search my $sth = $dbh->prepare( "select item_name,blah from table1 where item +_name like ?" $sth->execute( $srchString ); my $rows = $sth->fetchall_arrayref; for my $row ( @$rows ) { my ( $item, $blah ) = @$row; $item = decode( 'utf8', $item ); # make sure perl knows this is u +tf8 data # ... do other stuff... }
      Thanks so much graff and others I now understand a little more about data types. may be my data is in wrong format in table. And i found that i also have some problems on communication between program and client as Burak advised me. For this encoding mismatch in communication, i used as "default-character-set=sjis" and " skip-character-set-client-handshake" in my.ini of mysql. Is this ok when i upload to some real hosting ? i convert like that and i used sjis to store data in database.Then, the query problem of "SELECT ... WHERE field1 LIKE "$input"."%"become OK since user input is also in sjis encoding. Anyway, i am really grateful for your good advises. i have some other related problem. this is to convert ascii character 'A' to sjis (japanese full-width alphanumeric character) 'A' . But I am not sure that "the solution may be only known by the people who are working in japanese character" or such problem also exists in other unicode characters.
        i have some other related problem. this is to convert ascii character 'A' to sjis (japanese full-width alphanumeric character) 'A' .

        The easiest way might be to convert "plain ascii" characters into the unicode "fullwidth" characters first, then "Encode::encode()" the unicode string into shiftjis for output.

        The unicode range for the "fullwidth" ASCII characters begins at \x{FF01} for "!" (exclamation mark), and then proceeds through the normal ASCII character sequence up through \x{FF5E}. So one way to convert "plain ASCII" to "fullwidth ASCII" would be like this:

        use Encode; # pick one of the following lines to uncomment # (depends on what sort of display terminal you use): #binmode STDOUT, ":utf8"; #binmode STDOUT, ":encoding(shiftjis)"; $_ = join( '', map { chr() } 0x21 .. 0x7e ) . "\n"; # this is a test +string print; s/([!-~])/chr(ord($1)+0xFEE0)/eg; # convert plain ASCII to CJK fullwi +dth ASCII # print; # uncomment this when you have chosen which binmode line to +use
        (update: adding missing colon in 2nd binmode line)
Re: how to change unicode string to byte array ?
by Burak (Chaplain) on Sep 28, 2008 at 17:48 UTC
    This no such byte array. They seem to be HTML entities, i.e.: html escaped versions of the characters. It looks like you have a bigger problem. It'll be better to fix the communication between your program (i.e.: server) and the client to use utf8 and also be sure that your program and the database talks in utf8 too...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://714195]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2018-06-19 20:36 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (114 votes). Check out past polls.