http://www.perlmonks.org?node_id=772792

dhinesh has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a PERL program which reads an excel (.xls) and writes it to a txt file. All was well until one of the xls files had a column containing Traditional Chinese (Taiwan)characters in them. The generated txt files are having junks for that Taiwanese column. I'm new to PERL, I've been trying out various options in CPAN but none of them seems to be working (I would rather say I'm not correctly using them). Please let me know where I'm going wrong. For your information am using the excel convertor code from CPAN to convert the excel to txt.
  • Comment on Handling Traditional Chinese Characters

Replies are listed 'Best First'.
Re: Handling Traditional Chinese Characters
by ikegami (Patriarch) on Jun 18, 2009 at 18:15 UTC

    Please let me know where I'm going wrong.

    You seem to have forgotten to give us something from which we could identify something wrong.

    My first test would be to use Devel::Peek's Dump to check if I get what I need from Excel. Please provide the Dump of a variable that should contain Chinese chars.

Re: Handling Traditional Chinese Characters
by afoken (Chancellor) on Jun 18, 2009 at 18:25 UTC

    Are you using strict? Did you enable warnings? Did you tell perl how to write Unicode characters to the text file? Does your text file viewer know how (i.e. in which encoding) perl wrote the unicode characters to the file? What "excel convertor code from CPAN" did you use? (Tell us the URL!) And by the way: Show us the code, wrapped in CODE-tags.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Handling Traditional Chinese Characters
by derby (Abbot) on Jun 18, 2009 at 18:32 UTC

    Without any code to review, I would assume that the chinese characters are in UCS2 (UCS2-2BE) in excel and you're treating them like UTF-8 (or yikes ISO-8859-1) when outputting. Can you post *just* those snippets of the code that read the excel cell values?

    -derby
Re: Handling Traditional Chinese Characters
by Polyglot (Chaplain) on Jun 18, 2009 at 23:48 UTC
    From experience I can tell you that not every CPAN module is capable of properly handling the Asian languages, especially Chinese, Japanese, and Korean (CJK). I would guess, in fact, that the majority of them are not compatible with these languages. I have frequently had to write my own code to deal with them because of this.

    Here are some tips on ways to deal with everything in UTF8:

    use Encode; use Encode qw(encode decode); binmode STDOUT, ':utf8'; print "Content-type: text/html; charset=utf-8\n\n"; open SOURCE, '<:encoding(utf8)',$sourcefile or die "Cannot open source! $!\n"; open (TARGET, ">:encoding(utf8)", "$targetfile") or die "Cannot open target file! $!\n"; print TARGET <<HTML; <html lang="utf8"> <head> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf8"> ... <form name="myform" method="POST" accept-encoding="UTF-8" accept-chars +et="utf-8" action="$thisprogram"> ... HTML foreach $line (@source) { $line = decode("utf-8", $line);

    Note that you may not need to do all of these at once. For example, if you already read the file in as UTF8, there is no need to decode each line of the file as UTF8 again. However, redundancy should have no side effects other than adding a little more bulk to your code.

    Blessings,

    ~ Polyglot ~