Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^4: Parsing a .xlsx file with chinese characters

by Sithiris (Novice)
on Oct 03, 2011 at 21:32 UTC ( #929419=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Parsing a .xlsx file with chinese characters
in thread Parsing a .xlsx file with chinese characters

thanks for trying it. I'm guessing from a quick google search of xterm you are running the script in a non-Windows environment? Is it possible this would have an effect on it's success? I'm guessing doubtfully considering excel is a windows based programme.


Comment on Re^4: Parsing a .xlsx file with chinese characters
Re^5: Parsing a .xlsx file with chinese characters
by anneli (Pilgrim) on Oct 05, 2011 at 04:44 UTC

    You're right; I ran it on a Linux VM.

    If you're running this in the Windows terminal (cmd.exe or what have you), I'm inclined to think the problem isn't with the output from Excel::Spreadsheet, but that cmd doesn't display UTF-8 properly.

    What if you redirect the output of the script to a .html file, then try loading it in a browser? Make sure the encoding gets detected as UTF-8. If it displays correctly, it's just the terminal, and your data is fine. :)

      I have said in my script to print to a UTF8 encoded text file which I opened in word and it displayed correctly just wrong characters.

      what I am thinking is that it may be 'deconstructing the character for example instead of "\x{2013}" it is displaying "\xE2","\x80","\x93". If this is the case would there be a way to force it?

        I think Word is probably half-responsible for the mangling here. If it's trying to display each byte, then it means it's not actually reading it as UTF-8, but in some other encoding!

        I'll give an example using Windows. First, here's utf8.pl:

        # U+73E0 ("pearl") print "\xe7\x8f\xa0";

        Now, I execute that and redirect it to both utf8.html and utf8.txt.

        Chrome displays the character correctly, because it assumes UTF-8 by default. Notepad also appears smart enough to guess the encoding.

        On my system at least, opening the file with Word prompts me to select the encoding; and by default, it guesses UTF-8 and renders the character correctly. Note that if I pick "Windows (Default)" or "MS-DOS", I get garbage.

        So try messing with Word a bit; if you use the File -> Open menu (instead of just opening the file from Explorer directly), you can get additional conversion options (sometimes!).

        Anne

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929419]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (13)
As of 2014-11-26 08:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (164 votes), past polls