http://www.perlmonks.org?node_id=551157


in reply to What to do when converting Excel-supplied data to Unicode

I went through this problem recently. Hopefully the solution I hacked together will work for you as well. Hopefully you're using Spreadsheet::ParseExcel to read the Excel file or this advice will do you no good! :-)

Spreadsheet::ParseExcel allows you to specify a "Formatter" class when you parse an excel file which gives you some flexibility to decode or convert or do whatever other things you want to do as it reads in the file.

I couldn't get any of the Spreadsheet::ParseExcel::Fmt* modules to work how I wanted (I wanted the data to end up as UTF-8 data) so I wrote my own thin subclass of the Spreadsheet::ParseExcel::FmtDefault module:
package My::Excel::FmtUTF8; use strict; use base 'Spreadsheet::ParseExcel::FmtDefault'; use Encode qw(decode); # the super-class isn't very friendly to sub-classing, so we have to o +verride this to make # sure it's blessed into the right class sub new { my $class = shift; return bless {}, $class; } # the only other method we need to override... sub TextFmt { my ($self,$data,$encoding) = @_; # Spreadsheet::ParseExcel will pass in the encoding to us! # or, it passes nothing in if it's iso-8859-1 $encoding ||= 'iso-8859-1'; # we perform the decoding in a "fatal" manner so that if it fai +ls, # we'll just pass the data back as-is my $decoded = eval { decode($encoding,$data,1) } || $data; return $decoded; }
Then, when you parse the file, you do something like this:
my $parser = Spreadsheet::ParseExcel->new(); my $data = $parser->Parse( $excel_file_name, My::Excel::FmtUTF8->new() );
That seemed to work for all the non-ASCII data that I had to deal with. Hopefully it will work for you too! :-)

-- Brian

Replies are listed 'Best First'.
Re^2: What to do when converting Excel-supplied data to Unicode
by davis (Vicar) on May 23, 2006 at 13:56 UTC
    That's an interesting way of doing it, and I prefer the fact the data is fixed closer to the source.... I do use my Parsing module (which does use Spreadsheet::ParseExcel) for some stuff other than extraction to XML, so this is probably worth doing. I think I'll integrate my daft special case fixes with this Formatter class and call it Correct. It's probably also worth looking at putting in some code to display dates as "dd mmm yyyy" or "YYYY-MM-DD" too....
    cheers!

    davis
    Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
      I'd be interested to know if my specific solution makes the "special case" code unnecessary. IIRC, the data that is outside of ISO-8859-1 is encoded by Excel as UTF-16 (Big Endian I think) which would mean that everything would be valid UTF-8 data coming out -- without having to hardcode a list of search-and-replace operations for specific characters.

      -- Brian

      P.S. - By the way, my specific experience was with Excel 2003. I'm guessing older versions of the file format may have handled things differently.
Re^2: What to do when converting Excel-supplied data to Unicode
by ITFinanceGuy (Initiate) on Feb 20, 2009 at 22:53 UTC
    Hi, I tried using your formatter class but I do not see any difference in my garbled output. However, if I print the data inside the overriding TextFmt function, it looks correct. I read the data as in the following:

    my $formatter = My::Excel::FmtUTF8->new(); my $parser = Spreadsheet::ParseExcel->new(); my $excel = $parser->Parse($file, $formatter); my $sheet = $excel->{Worksheet}->[0]; ... my $cell = $sheet->{Cells}[$row][$col]; my $cellVal = $cell->{Val};

    Does that look correct? Why does my $cellVal end up different than $decoded in TextFmt()?

    Is there any other (easy/perl) way to read Excel files that have Unicode data? It is confusing to me that I can write Unicode data to an excel file using WriteExcel, but I cannot read it.

    Regards,
    ITFinanceGuy

      Curious. When do you print $decoded inside the sub? Just before the return?

      davis

        Yes. I just figured it out. I was reading the data using $cell->{Val} which apparently contains the unformatted cell value. I changed to using $cell->value() and that worked (almost..).

        I still cannot read Chinese characters from an Excel file. I am guessing the encoding I pass to the decode() function in TextFmt() is wrong, but I do not know how to find the encoding of Excel cells. I tried 'UTF16-LE', 'UTF16-BE', and 'UTF-8' but none worked. Leaving the TextFmt function as it is posted here does not work. Anyone have an idea as to what I might be doing wrong?

        Yes. I figured out part of the problem. Apparently $cell->{Val} contains the unformatted value of a cell, whereas $cell->value() has the formatted value.

        But I am still having trouble extracting the right data. My output looks garbled. I tried converting from UTF16-LE, UTF16-BE, UTF-8, and nothing seems to work. I even tried downloading several utilities from the web with no luck converting Unicode data. The only thing that does work is if I save the file as a different format from within Excel itself, but I am looking for a batch way to do this.

        Does anyone know of a command line utility that will convert .xls files containing Unicode data into XML? Or a library that works? Thanks in advance!

        Nevermind. I determined that I should be using visual basic + the Microsoft Office Excel reference library to convert the files and then read them with Perl. Following suggestions from:
        http://stackoverflow.com/questions/174446/how-to-automate-converting-excel-xls-files-to-excel-xml-fromat