Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Mixed character encoding issues

by ddaupert (Initiate)
on Jul 05, 2012 at 22:20 UTC ( #980169=perlquestion: print w/ replies, xml ) Need Help??
ddaupert has asked for the wisdom of the Perl Monks concerning the following question:

I've had no dealings with encoding previously, and am running into a bit of complexity. This may be a bit long, so let me first summarize.

SUMMARY: There are multiple steps data passes through where encoding may come into play. The input step is different between production and testing, but all further steps are the same. Production will import data from a MSSQL Database, whereas testing will input data from Excel spreadsheets. For now, I am focused on the testing process. (Yes, it would be better to also import test data from MSSQL, but we are constrained.)

OUTLINE of the testing source-to-destination steps:

  1. Input data using Excel spreadsheets (.xls)
  2. Store data in SQLite Database
  3. Query SQLite and process the data
  4. Write output to csv files

The input spreadsheet I have contains user data where names and addresses are from some European countries (I do not know which, nor what languages are in play.) I am running into garbled characters from the very first step. I am not sure how to inspect the data at each step to compare data along the way.

I believe Excel stores data using cp1252, so have tried to convert the data to utf8. The code below gives the error "Cannot decode string with wide characters" which makes me think there may be mixed-encoded data in my input; that is, maybe some columns are of different encoding than other columns.

package DB::SQLite::Tables; use warnings; use strict; use Carp; use Encode; use Spreadsheet::ParseExcel::Simple; sub load_test_people_tbl { my ( $the, $path, $readfile, $sqlite_schema, $do_load ) = @_; my $data_obj = Spreadsheet::ParseExcel::Simple->read( "$path/$readfile" ); foreach my $sheet ($data_obj->sheets) { ROW: while ( $sheet->has_data ) { @row = $sheet->next_row; $row_count_ppl++; # get header from input file if ( $row_count_ppl == 1 ) { @fields = @row; map { $_ =~ s{[\s+|-]}{_}xms } @fields; } # Gives error "Cannot decode string with wide characters" map { $_ = decode("utf8", $_) } @row; # Uses DBIx::Class if ( $do_load == 1 ) { $sqlite_schema->populate('Person', [ \@fields, \@row, ]); } } # end ROW } } # end load_test_people_tbl

Example input/output character mangling:

Some name in input xls spreadsheet: Toxv‘rd

later....

Same name in output csv spreadsheet: Toxv‘rd

QUESTIONS:

  • Did I give enough information to chew on? If not, what else do I need to post?
  • Am I on the right track, that I should decode the input spreadsheet data to utf8?
  • Is there a way to determine whether the "mixed-encoding" theory is correct? That is, how do I know if the data elements in any given row are of the same encoding for all columns? (Or maybe I am not decoding the data in the right way?)
  • When I go into sqlite3's shell and query, I see garbled characters even when data are correct. Is there a way to display correctly in a dos window?

/dennis

Comment on Mixed character encoding issues
Download Code
Re: Mixed character encoding issues
by zentara (Archbishop) on Jul 05, 2012 at 23:03 UTC
    I don't have excel to test your script with, and I'm just beginning to get a handle on unicode, but a similar problem was recently resolved at Re^3: Perl TK character disappearing. You can't decode your cp1252 input as utf-8. You can decode it as cp1252 then save it as utf8.

    If I was able to run the code, this would be my first thing to try. I like the utf8::all module.

    # this will make all filehandles and STDOUT prints be utf8 use utf8::all; # a very simple module # get your excel rows and decode them # Gives error "Cannot decode string with wide characters" # map { $_ = decode("utf8", $_) } @row; # decode the cp1252 stuff properly map { $_ = decode("cp1252", $_) } @row; # prints to the csv file will be utf8

    Windows-1252 characters from \x{0080} thru \x{009f} might also be useful to you.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      I appreciate the help, zentara. I did install utf8::all and used it in the packages where I open/close files. Alas, it made no difference as far as output. I got the same character mangling as before. But I am sure the utf8::all module will come in handy. Thanks for mentioning it.

      Regarding the decode suggestion, I based my usage on the Encode documention:

      ...to convert ISO-8859-1 data into a string in Perl's internal format:

      $string = decode("iso-8859-1", $octets);

      This was just a 'Hail Mary' attempt; I am not at all convinced my input string is in octets, and when I precede this function with its mate that does a translation into octets, the output gets even more mangled.

      When I get off work today, I will look into the other links you mentioned.

      /dennis

Re: Mixed character encoding issues
by nikosv (Hermit) on Jul 06, 2012 at 14:48 UTC
    I believe Excel stores data using cp1252

    I don't think that's correct.Excel is Unicode enabled by default. Try it out by entering a character available in the Unicode domain:

    download a free Japanese font available here ,install it, open a worksheet and do Insert>Symbol>find the font and click on a letter,save it and then open it again.The character should be there in its original representation.

    Since MS has a twisted notion of Unicode, I consider that the excel file is saved as UTF16, which is what is considered Unicode by MS (while UTF8 is considered multi-byte)

    my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8

      my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8

      I did have that thought, and made some attempts to decode from utf16 using Encode.pm, but was unsuccessful. I tried using variations on this convention:

      $string = decode("utf16", $octets);

      but I get this error:

      UTF-16:Unrecognised BOM 3230 at C:/Dwimperl/perl/lib/Encode.pm line 17 +6

      Can you give me a nudge as to how to decode from UTF16?

      I appreciate your help very much. The information regarding the font is interesting. I am certain the language at issue is not Japanese, but the principle should be the same. I will try to find out through our deployment group what languages are in play. That info has been difficult to come by so far.

      /dennis

        The BOM is a byte sequence that identifies the document as UTF16 and is prepended to the file contents.

        I am under the impression that MS documents (Excel,Word) contain this sequence not on the beginning of the file but somewhere later because they reserve the first few bytes for their header information.(They even write a BOM for UTF8 too,which sucks of course)

        Because the document and BOM can be UTF16 LE or BE,Encode does need to understand what kind it is. Check this Stackoveflow answer by brian d foy.

        However to simplify the process and save you from the trouble with excel,I would suggest that you open the excel file and export the data as UTF8 text or CSV file. Then you can use your Perl parser/module of choice to get to the contents

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://980169]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2014-09-23 21:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (241 votes), past polls