I've had no dealings with encoding previously, and am running
into a bit of complexity. This may be a bit long, so let me first summarize.
SUMMARY: There are multiple steps data passes through where encoding may come
into play. The input step is different between production and testing, but all
further steps are the same. Production will import data from a MSSQL Database,
whereas testing will input data from Excel spreadsheets. For now, I am
focused on the testing process. (Yes, it would be better
to also import test data from MSSQL, but we are constrained.)
OUTLINE of the testing source-to-destination steps:
- Input data using Excel spreadsheets (.xls)
- Store data in SQLite Database
- Query SQLite and process the data
- Write output to csv files
The input spreadsheet I have contains user data where names and addresses
are from some European countries (I do not know which, nor what languages
are in play.) I am running into garbled characters from the very first step.
I am not sure how to inspect the data at each step to compare data along the way.
I believe Excel stores data using cp1252, so have tried to convert the data
to utf8. The code below gives the error "Cannot decode string with wide characters"
which makes me think there may be mixed-encoded data in my input; that is, maybe some
columns are of different encoding than other columns.
package DB::SQLite::Tables;
use warnings;
use strict;
use Carp;
use Encode;
use Spreadsheet::ParseExcel::Simple;
sub load_test_people_tbl {
my ( $the, $path, $readfile, $sqlite_schema, $do_load ) = @_;
my $data_obj
= Spreadsheet::ParseExcel::Simple->read( "$path/$readfile" );
foreach my $sheet ($data_obj->sheets) {
ROW:
while ( $sheet->has_data ) {
@row = $sheet->next_row;
$row_count_ppl++;
# get header from input file
if ( $row_count_ppl == 1 ) {
@fields = @row;
map { $_ =~ s{[\s+|-]}{_}xms } @fields;
}
# Gives error "Cannot decode string with wide characters"
map { $_ = decode("utf8", $_) } @row;
# Uses DBIx::Class
if ( $do_load == 1 ) {
$sqlite_schema->populate('Person', [
\@fields,
\@row,
]);
}
} # end ROW
}
} # end load_test_people_tbl
Example input/output character mangling:
Some name in input xls spreadsheet: Toxv‘rd
later....
Same name in output csv spreadsheet: Toxv‘rd
QUESTIONS:
- Did I give enough information to chew on? If not, what else do I need to post?
- Am I on the right track, that I should decode the input spreadsheet data to utf8?
- Is there a way to determine whether the "mixed-encoding" theory is correct? That is,
how do I know if the data elements in any given row are of the same encoding for all columns? (Or maybe
I am not decoding the data in the right way?)
- When I go into sqlite3's shell and query, I see garbled characters even when
data are correct. Is there a way to display correctly in a dos window?
/dennis
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.