davis has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract data from Excel spreadsheets. I'm pretty sure most of the data in these spreadsheets is encoded as ISO 8859-1. Because this data is going into an XML file, and from there into a MySQL database, I'm trying to do The Right Thing and coerce the data into Unicode as early as possible.

Here's the problem:
Some of the data appears to be in a different encoding. As a specific example, it sometimes contains a "—" (that's a em-dash character, hopefully) which gets extracted as ^S (CTRL-S, try typing that in an xterm :) ). To my knowledge, this isn't in ISO 8859-1.

At the moment, I'm writing character-specific code that processes each character on a case-by-case basis. E.g. I'm converting Excel's rendition of em dash to "--". Is this the best way of doing it? should I be decodeing from a different character set?

Update: It looks like the em dash character is in Windows-1252, a superset of ISO 8859-1... perhaps that's the encoding I'm seeing... but using "cp1252" as the argument to decode doesn't seem to fix the problem.

Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
  • Comment on What to do when converting Excel-supplied data to Unicode

Replies are listed 'Best First'.
Re: What to do when converting Excel-supplied data to Unicode
by ForgotPasswordAgain (Priest) on May 23, 2006 at 10:56 UTC
      I'd just discovered that after posting my node, unfortunately, even after trying to decode from cp1252 the un-handy CTRL-Ss are still there. Thanks anyway.

      Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
Re: What to do when converting Excel-supplied data to Unicode
by bpphillips (Friar) on May 23, 2006 at 13:48 UTC
    I went through this problem recently. Hopefully the solution I hacked together will work for you as well. Hopefully you're using Spreadsheet::ParseExcel to read the Excel file or this advice will do you no good! :-)

    Spreadsheet::ParseExcel allows you to specify a "Formatter" class when you parse an excel file which gives you some flexibility to decode or convert or do whatever other things you want to do as it reads in the file.

    I couldn't get any of the Spreadsheet::ParseExcel::Fmt* modules to work how I wanted (I wanted the data to end up as UTF-8 data) so I wrote my own thin subclass of the Spreadsheet::ParseExcel::FmtDefault module:
    package My::Excel::FmtUTF8; use strict; use base 'Spreadsheet::ParseExcel::FmtDefault'; use Encode qw(decode); # the super-class isn't very friendly to sub-classing, so we have to o +verride this to make # sure it's blessed into the right class sub new { my $class = shift; return bless {}, $class; } # the only other method we need to override... sub TextFmt { my ($self,$data,$encoding) = @_; # Spreadsheet::ParseExcel will pass in the encoding to us! # or, it passes nothing in if it's iso-8859-1 $encoding ||= 'iso-8859-1'; # we perform the decoding in a "fatal" manner so that if it fai +ls, # we'll just pass the data back as-is my $decoded = eval { decode($encoding,$data,1) } || $data; return $decoded; }
    Then, when you parse the file, you do something like this:
    my $parser = Spreadsheet::ParseExcel->new(); my $data = $parser->Parse( $excel_file_name, My::Excel::FmtUTF8->new() );
    That seemed to work for all the non-ASCII data that I had to deal with. Hopefully it will work for you too! :-)

    -- Brian
      That's an interesting way of doing it, and I prefer the fact the data is fixed closer to the source.... I do use my Parsing module (which does use Spreadsheet::ParseExcel) for some stuff other than extraction to XML, so this is probably worth doing. I think I'll integrate my daft special case fixes with this Formatter class and call it Correct. It's probably also worth looking at putting in some code to display dates as "dd mmm yyyy" or "YYYY-MM-DD" too....

      Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
        I'd be interested to know if my specific solution makes the "special case" code unnecessary. IIRC, the data that is outside of ISO-8859-1 is encoded by Excel as UTF-16 (Big Endian I think) which would mean that everything would be valid UTF-8 data coming out -- without having to hardcode a list of search-and-replace operations for specific characters.

        -- Brian

        P.S. - By the way, my specific experience was with Excel 2003. I'm guessing older versions of the file format may have handled things differently.
      Hi, I tried using your formatter class but I do not see any difference in my garbled output. However, if I print the data inside the overriding TextFmt function, it looks correct. I read the data as in the following:

      my $formatter = My::Excel::FmtUTF8->new(); my $parser = Spreadsheet::ParseExcel->new(); my $excel = $parser->Parse($file, $formatter); my $sheet = $excel->{Worksheet}->[0]; ... my $cell = $sheet->{Cells}[$row][$col]; my $cellVal = $cell->{Val};

      Does that look correct? Why does my $cellVal end up different than $decoded in TextFmt()?

      Is there any other (easy/perl) way to read Excel files that have Unicode data? It is confusing to me that I can write Unicode data to an excel file using WriteExcel, but I cannot read it.


        Curious. When do you print $decoded inside the sub? Just before the return?


Re: What to do when converting Excel-supplied data to Unicode
by davis (Vicar) on May 23, 2006 at 13:51 UTC

    Well, it's solved well enough for me. I treat the text as CodePage 1252, but before decoding it I fix some special cases such as the dash, a weird apostrophe-type character, and some others. The resulting XML is well-formed, the data in MySQL looks good, and the produced XHTML-strict is valid and compliant. I'm happy with that.

    Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
      # ------------------------------------------------------ # convert an excel file into a hash ref of hash ref of hash refs # ------------------------------------------------------ sub doReadXlsFileToHsr2 { my $self = shift; my $xls_file = shift; my $ret = 1; my $msg = "open the xls_file: $xls_file"; $objLogger->doLogDebugMsg($msg); my $formatter = Spreadsheet::ParseExcel::FmtJapan->new(); my $objXlsParser = 'Spreadsheet::ParseExcel'->new(); # my $objWorkbook = $objXlsParser->Parse( $xls_file , $formatte +r ); my $objWorkbook = $objXlsParser->Parse($xls_file); my $hsr2 = {}; # this is the data hash ref of hash refs # check if we are using Excel2007 open xml format if (!defined $objWorkbook) { # works too my $objConverter = () ; my $objConverter = Text::Iconv->new("utf-8", "utf-8"); # my $objConverter = (); $objWorkbook = Spreadsheet::XLSX->new($xls_file, $objConverter); # exit the whole application if there is no excel defined if (!defined $objWorkbook) { $msg = "cannot parse \$xls_file $xls_file $! $objXlsParser->erro +r()"; $objLogger->doLogErrorMsg("$msg"); return ($ret, $msg, {}); } } #eof if not $objWorkbook foreach my $worksheet (@{$objWorkbook->{Worksheet}}) { my $hsWorkSheet = {}; my $WorkSheetName = $worksheet->{'Name'}; next unless $WorkSheetName =~ m/^.*_issues$/g; $objLogger->doLogDebugMsg("foreach my worksheet: " . $WorkSheetNam +e) if ($module_trace == 1); my $RowMin = $worksheet->{'MinRow'}; my $RowMax = $worksheet->{'MaxRow'}; # my ( $RowMin, $RowMax) = $worksheet->row_range(); # my ( $MinCol, $MaxCold ) = $worksheet->col_range(); my $row_num = 0; for my $row ($RowMin .. $RowMax) { my $hsRow = {}; my $MinCol = $worksheet->{'MinCol'}; my $MaxCol = $worksheet->{'MaxCol'}; #debug print "MinCol::$MinCol , MaxCol::$MaxCol \n" ; my $col_num = 0; #print "row_num:: $row_num \n" ; for my $col ($MinCol .. $MaxCol) { # debug print "col_num:: $col_num \n" ; my $cell = $worksheet->{'Cells'}[$row][$col]; my $obj_header = $worksheet->{'Cells'}[0][$col]; my $header = $obj_header->unformatted(); my $token = ''; # to represent NULL in the sql unless (defined($cell)) { $token = 'NULL'; } else { # this one seems to return the value ONLY if # it is formateed properly with Ctrl + 1 # $token = $cell->Value(); # this one seems to return the value as it has been typed in +to ... $token = $cell->unformatted(); # this is must have !!! $token = decode('utf8', $token); # $token = $cell->{'Val'} ; my $encoding = $cell->encoding(); # debug print "token is :: " . $token . "\n" ; # debug print "encoding is :: " . $encoding . "\n" ; # debug print "is_utf8 " . is_utf8 ( $token ) ; # p($token); # and this one of those wtf moments ?! $token =~ s/\&gt;/\>/g; $token =~ s/\&lt;/\</g; $token =~ s/\&amp;/\&/g; } $hsRow->{$header} = $token; $col_num++; } #eof for col $hsWorkSheet->{"$row_num"} = $hsRow; $row_num++; # debug sleep 3 ; } #eof foreach row $hsr2->{"$WorkSheetName"} = $hsWorkSheet; # p($hsWorkSheet ); } $ret = 0; $msg = 'xls file parse OK'; return ($ret, $msg, $hsr2); }