Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Mixed character encoding issues

by nikosv (Hermit)
on Jul 06, 2012 at 14:48 UTC ( #980286=note: print w/ replies, xml ) Need Help??


in reply to Mixed character encoding issues

I believe Excel stores data using cp1252

I don't think that's correct.Excel is Unicode enabled by default. Try it out by entering a character available in the Unicode domain:

download a free Japanese font available here ,install it, open a worksheet and do Insert>Symbol>find the font and click on a letter,save it and then open it again.The character should be there in its original representation.

Since MS has a twisted notion of Unicode, I consider that the excel file is saved as UTF16, which is what is considered Unicode by MS (while UTF8 is considered multi-byte)

my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8


Comment on Re: Mixed character encoding issues
Re^2: Mixed character encoding issues
by ddaupert (Initiate) on Jul 06, 2012 at 21:44 UTC
    my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8

    I did have that thought, and made some attempts to decode from utf16 using Encode.pm, but was unsuccessful. I tried using variations on this convention:

    $string = decode("utf16", $octets);

    but I get this error:

    UTF-16:Unrecognised BOM 3230 at C:/Dwimperl/perl/lib/Encode.pm line 17 +6

    Can you give me a nudge as to how to decode from UTF16?

    I appreciate your help very much. The information regarding the font is interesting. I am certain the language at issue is not Japanese, but the principle should be the same. I will try to find out through our deployment group what languages are in play. That info has been difficult to come by so far.

    /dennis

      The BOM is a byte sequence that identifies the document as UTF16 and is prepended to the file contents.

      I am under the impression that MS documents (Excel,Word) contain this sequence not on the beginning of the file but somewhere later because they reserve the first few bytes for their header information.(They even write a BOM for UTF8 too,which sucks of course)

      Because the document and BOM can be UTF16 LE or BE,Encode does need to understand what kind it is. Check this Stackoveflow answer by brian d foy.

      However to simplify the process and save you from the trouble with excel,I would suggest that you open the excel file and export the data as UTF8 text or CSV file. Then you can use your Perl parser/module of choice to get to the contents

        nikosv: I took your suggestion to export from excel as a csv file, and that did help a great deal. I want to thank you for that. It gave me a solid place to move from. By reading the csv file and writing it immediately out as a first step, I was able to verify the characters were like for like. Then I stacked on the additional steps of importing into the DB, etc., and watched what happened.

        It was also quite useful to understand the bit about double encoding. I noticed as I made code changes that the data became more or less mangled. I found something quite amazing by watching this behavior. When creating my tables in SQLite, I used DBI more or less directly; I had been using the sqlite_unicode connection setting, but found I needed NOT to set sqlite_unicode => 1 in the connection statement. But contrary to that, when running queries through DBIx::Class machinery, I DID need to set sqlite_unicode => 1 in the DB connection. Once all that was sorted out, all data were read, input into the DB, read back out of the DB, and written out to the final files while preserving proper encoding.

        Many, many thanks.

        /dennis

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980286]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-08-28 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (259 votes), past polls