dstrom has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to convert ascii text that represents Chinese characters to these characters in for example utf. Chinese characters in my file are encoded in two hexes in GB, e.g., 中 is (D6 D0). I just have the ascii text of the two hexes "D6 D0", etc., how can this be converted to Chinese characters? I appreciate your help, thanks in advance...
  • Comment on How do I convert a sequence of hexes (D0 D6) to Chinese characters (中)?

Replies are listed 'Best First'.
Re: How do I convert a sequence of hexes (D0 D6) to Chinese characters (中)?
by kejohm (Hermit) on Jun 05, 2011 at 22:15 UTC

    You could use the hex() function to convert the hex strings into their corresponding values, then use the chr() function to get the character represented by that value, eg.

    chr( hex( 'D0D6' ) );

    Update: Fixed missing quotes, thanks davido

      I'm not sure that's what the original poster needs. chr(0xD6D0) means Unicode code point U+D6D0 which is the character '훐'. Whereas the poster said the bytes represented by the ASCII string 'D6D0' are the character '中' in a 'GB' encoding. I'm not very knowledgeable about Asian encodings but I'll assume that the specific encoding is GB-2312.

      So the things we need to do are:

      1. convert the ASCII hex string into bytes
      2. decode the bytes from GB-2312 to Perl's internal character representation
      3. convert to a suitable output encoding

      Here's a complete script which does all of that:

      #!/usr/bin/perl use strict; use warnings; use Encode qw(decode); my $ascii_hex = 'D6D0'; # continue for as many bytes as required my $bytes = pack('H*', $ascii_hex); my $character_string = decode('gb2312', $bytes); binmode(STDOUT, ':utf8'); print $character_string, "\n";

        Yes, you're right. I must admit that I do not deal with Unicode very often, and therefore am not very knowledgeable on the subject.

        I am very grateful for your help. However, when I run your program, I do not get the character '中', but rather the non-sensical "Σ╕". Any idea of what is going wrong? Thanks. (I have the East Asian language pack installed, so it is not simply that.)

      chr( hex( 0xD0D6 ) )

      ...should be written as chr( hex( '0xD0D6' ) ), shouldn't it?


        Yes, thanks for that. I've fixed my post.