|Perl: the Markov chain saw|
What to do when converting Excel-supplied data to Unicodeby davis (Vicar)
|on May 23, 2006 at 10:44 UTC||Need Help??|
davis has asked for the
wisdom of the Perl Monks concerning the following question:
I'm trying to extract data from Excel spreadsheets. I'm pretty sure most of the data in these spreadsheets is encoded as ISO 8859-1. Because this data is going into an XML file, and from there into a MySQL database, I'm trying to do The Right Thing and coerce the data into Unicode as early as possible.
Here's the problem:
At the moment, I'm writing character-specific code that processes each character on a case-by-case basis. E.g. I'm converting Excel's rendition of em dash to "--". Is this the best way of doing it? should I be decodeing from a different character set?
Update: It looks like the em dash character is in Windows-1252, a superset of ISO 8859-1... perhaps that's the encoding I'm seeing... but using "cp1252" as the argument to decode doesn't seem to fix the problem.
Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.