http://www.perlmonks.org?node_id=980169

ddaupert has asked for the wisdom of the Perl Monks concerning the following question:

I've had no dealings with encoding previously, and am running into a bit of complexity. This may be a bit long, so let me first summarize.

SUMMARY: There are multiple steps data passes through where encoding may come into play. The input step is different between production and testing, but all further steps are the same. Production will import data from a MSSQL Database, whereas testing will input data from Excel spreadsheets. For now, I am focused on the testing process. (Yes, it would be better to also import test data from MSSQL, but we are constrained.)

OUTLINE of the testing source-to-destination steps:

  1. Input data using Excel spreadsheets (.xls)
  2. Store data in SQLite Database
  3. Query SQLite and process the data
  4. Write output to csv files

The input spreadsheet I have contains user data where names and addresses are from some European countries (I do not know which, nor what languages are in play.) I am running into garbled characters from the very first step. I am not sure how to inspect the data at each step to compare data along the way.

I believe Excel stores data using cp1252, so have tried to convert the data to utf8. The code below gives the error "Cannot decode string with wide characters" which makes me think there may be mixed-encoded data in my input; that is, maybe some columns are of different encoding than other columns.

package DB::SQLite::Tables; use warnings; use strict; use Carp; use Encode; use Spreadsheet::ParseExcel::Simple; sub load_test_people_tbl { my ( $the, $path, $readfile, $sqlite_schema, $do_load ) = @_; my $data_obj = Spreadsheet::ParseExcel::Simple->read( "$path/$readfile" ); foreach my $sheet ($data_obj->sheets) { ROW: while ( $sheet->has_data ) { @row = $sheet->next_row; $row_count_ppl++; # get header from input file if ( $row_count_ppl == 1 ) { @fields = @row; map { $_ =~ s{[\s+|-]}{_}xms } @fields; } # Gives error "Cannot decode string with wide characters" map { $_ = decode("utf8", $_) } @row; # Uses DBIx::Class if ( $do_load == 1 ) { $sqlite_schema->populate('Person', [ \@fields, \@row, ]); } } # end ROW } } # end load_test_people_tbl

Example input/output character mangling:

Some name in input xls spreadsheet: Toxv‘rd

later....

Same name in output csv spreadsheet: Toxv‘rd

QUESTIONS:

/dennis

Replies are listed 'Best First'.
Re: Mixed character encoding issues
by zentara (Archbishop) on Jul 05, 2012 at 23:03 UTC
    I don't have excel to test your script with, and I'm just beginning to get a handle on unicode, but a similar problem was recently resolved at Re^3: Perl TK character disappearing. You can't decode your cp1252 input as utf-8. You can decode it as cp1252 then save it as utf8.

    If I was able to run the code, this would be my first thing to try. I like the utf8::all module.

    # this will make all filehandles and STDOUT prints be utf8 use utf8::all; # a very simple module # get your excel rows and decode them # Gives error "Cannot decode string with wide characters" # map { $_ = decode("utf8", $_) } @row; # decode the cp1252 stuff properly map { $_ = decode("cp1252", $_) } @row; # prints to the csv file will be utf8

    Windows-1252 characters from \x{0080} thru \x{009f} might also be useful to you.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      I appreciate the help, zentara. I did install utf8::all and used it in the packages where I open/close files. Alas, it made no difference as far as output. I got the same character mangling as before. But I am sure the utf8::all module will come in handy. Thanks for mentioning it.

      Regarding the decode suggestion, I based my usage on the Encode documention:

      ...to convert ISO-8859-1 data into a string in Perl's internal format:

      $string = decode("iso-8859-1", $octets);

      This was just a 'Hail Mary' attempt; I am not at all convinced my input string is in octets, and when I precede this function with its mate that does a translation into octets, the output gets even more mangled.

      When I get off work today, I will look into the other links you mentioned.

      /dennis

Re: Mixed character encoding issues
by nikosv (Deacon) on Jul 06, 2012 at 14:48 UTC
    I believe Excel stores data using cp1252

    I don't think that's correct.Excel is Unicode enabled by default. Try it out by entering a character available in the Unicode domain:

    download a free Japanese font available here ,install it, open a worksheet and do Insert>Symbol>find the font and click on a letter,save it and then open it again.The character should be there in its original representation.

    Since MS has a twisted notion of Unicode, I consider that the excel file is saved as UTF16, which is what is considered Unicode by MS (while UTF8 is considered multi-byte)

    my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8

      my hunch is that you do some sort of double encoding, so I would suggest to decode the character from UTF16 to UTF8

      I did have that thought, and made some attempts to decode from utf16 using Encode.pm, but was unsuccessful. I tried using variations on this convention:

      $string = decode("utf16", $octets);

      but I get this error:

      UTF-16:Unrecognised BOM 3230 at C:/Dwimperl/perl/lib/Encode.pm line 17 +6

      Can you give me a nudge as to how to decode from UTF16?

      I appreciate your help very much. The information regarding the font is interesting. I am certain the language at issue is not Japanese, but the principle should be the same. I will try to find out through our deployment group what languages are in play. That info has been difficult to come by so far.

      /dennis

        The BOM is a byte sequence that identifies the document as UTF16 and is prepended to the file contents.

        I am under the impression that MS documents (Excel,Word) contain this sequence not on the beginning of the file but somewhere later because they reserve the first few bytes for their header information.(They even write a BOM for UTF8 too,which sucks of course)

        Because the document and BOM can be UTF16 LE or BE,Encode does need to understand what kind it is. Check this Stackoveflow answer by brian d foy.

        However to simplify the process and save you from the trouble with excel,I would suggest that you open the excel file and export the data as UTF8 text or CSV file. Then you can use your Perl parser/module of choice to get to the contents