Processing spreadsheet with some cells in ASCII, other cells in UTF-8

Amphiaraus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8 by 1nickt (Canon) on Sep 01, 2015 at 20:55 UTC
I think you are confused about encoding. It can be pretty confusing. See perlunitut, "Unicode and Strings" in Modern Perl, The Perl Unicode Cookbook ... As you know, if you try to print a "wide" unicode character, Perl gives you a warning: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; say "Ferrari 308 \x{1F44D}"; __END__` $ perl 1140714.pl Wide character in say at 1140714.pl line 6. Ferrari 308 👍 $ You can fix this as stevieb pointed out below, with `binmode`: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; say "Ferrari 308 \x{1F44D}"; __END__` $ perl 1140714.pl Ferrari 308 👍 $ If you want to use the unicode characters in your Perl code, you can't just expect Perl to know what they are: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; say "Ferrari 308 👍"; __END__` `$ perl 1140714.pl Ferrari 308 ðŸ‘` ... fix that by `use`ing `utf8`: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; use utf8; say "Ferrari 308 👍"; __END__` $ perl 1140714.pl Ferrari 308 👍 $ If you are going to read in data that might have unicode characters, eg: $ cat 1140714.txt Lotus Élan 👍 任意のスーパーカー $ ... you can't expect Perl to know what you're giving it: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; use utf8; open my $in, '<', '1140714.txt' or die "open: $!\n"; print for (<$in>); __END__` `$ perl 1140714.pl Lotus Ã‰lan ðŸ‘ ä»»æ„ã®ã‚¹ãƒ¼ãƒ‘ãƒ¼` . . . you can fix that by using an I/O layer in your `open`: `$ cat 1140714.pl #!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; use utf8; open my $in, '< :utf8', '1140714.txt' or die "open: $!\n"; print for (<$in>); __END__` `$ perl 1140714.pl Lotus Élan 👍 任意のスーパーカー` If you print your unicode data to a filehandle you'll get the wide-character warning again: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; use utf8; open my $in, '< :utf8', '1140714.txt' or die "open: $!\n"; open my $out, '>', '1140714.out' or die "open: $!\n"; print $out $_ for (<$in>); close $out or die "close: $!\n"; __END__` `$ perl 1140714.pl Wide character in print at 1140714.pl line 11, <$in> line 2. Wide character in print at 1140714.pl line 11, <$in> line 2. $` . . . fix it with an I/O layer: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; binmode STDOUT, ':utf8'; use utf8; open my $in, '< :utf8', '1140714.txt' or die "open: $!\n"; open my $out, '> :utf8', '1140714.out' or die "open: $!\n"; print $out $_ for (<$in>); close $out or die "close: $!\n"; __END__` `$ perl 1140714.pl $ cat 1140714.out Lotus Élan 👍 任意のスーパーカー $` No encoding needed at all. Hope this helps! Update: Added examples and links The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8 by tangent (Parson) on Sep 01, 2015 at 21:04 UTC
These garbled strings are accompanied by the error message "Wide character in print" How are you opening the .ini file for output - try this if not already: `open ( my $fh, ">:encoding(utf8)", "file.ini" ) or die "file.ini: $!";` [download]	[reply] [d/l]
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8 by stevieb (Canon) on Sep 01, 2015 at 20:57 UTC
This is my first ever attempt at non-ascii processing, so I'll let the more experienced Monks criticize if this is the wrong approach, or if there's a better one. After a very quick dig online, I found that setting `binmode` on all the file handles can fix the issue: Input file: `$ cat in.txt }cýæu}]…‘¦å hello ›ÇÁ` [download] Code: `use warnings; use strict; open my $fh, '<', 'in.txt' or die $!; open my $wfh, '>', 'out.txt' or die $!; binmode $fh, ":utf8"; binmode $wfh, ":utf8"; binmode STDOUT, ":utf8"; while (<$fh>){ chomp; print $wfh "file: $_\n"; print "stdout: $_\n"; }` [download] Output: `# output file file: }cýæu}]…‘¦å file: hello ›ÇÁ # stdout stdout: }cýæu}*]…‘¦å stdout: hello ›ÇÁ` [download]	[reply] [d/l] [select]
Re^2: Processing spreadsheet with some cells in ASCII, other cells in UTF-8 by Amphiaraus (Beadle) on Sep 02, 2015 at 19:31 UTC
The problem in our team's Perl script was fixed simply by adding a one-line change at the top of the script: `binmode STDOUT, ':utf8'; #This handles the multiple encoding for language menus in the Perl IO layers.` The above one-line change fixed the problem, the Perl script can now process contents of Excel spreadsheet cells, no matter what type of character encoding was used in them. No additional low-level function calls, to decode or encode, were needed. This change was taken from one of the replies to my original question. Thanks for your help. The various web pages found in the replies, which discussed character encoding, were also very helpful.	[reply] [d/l]


Welcome to the Monastery
	PerlMonks