Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Processing spreadsheet with some cells in ASCII, other cells in UTF-8

by Amphiaraus (Beadle)
on Sep 01, 2015 at 20:43 UTC ( [id://1140714]=perlquestion: print w/replies, xml ) Need Help??

Amphiaraus has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a Perl program which takes input from a column in an Excel spreadsheet listing names of European cars. This column is named "MenuList". The names of European cars often contain letters with symbols above them that are not seen in English, and which must be encoded in UTF-8 (Examples: , , ). Other cells in the same column lack these foreign symbols and are encoded in ASCII. I am writing out the contents of this column to an *.ini file.

I am finding that with the code shown below, the car names in ASCII are being written out to the *.ini file without problems, but the car names with foreign symbols - encoded in UTF-8 - are garbled in the *.ini file. These garbled strings are accompanied by the error message "Wide character in print"

Is there a way to read input from an Excel spreadsheet with mixed encoding (some cells in UTF-8, other cells in ASCII), and write contents of these cells to an *.ini file, with no garbled output in the *.ini from the cells that contained UTF-8 encoding?"

chomp $menuList[$i]; $decodedMenuList = decode("utf8",$menuList[$i]); $cfg_baselicense->newval($partNumList[$i],"MenuSelection",$decod +edMenuList); $cfg_baselicense->RewriteConfig();
  • Comment on Processing spreadsheet with some cells in ASCII, other cells in UTF-8
  • Download Code

Replies are listed 'Best First'.
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8
by 1nickt (Canon) on Sep 01, 2015 at 20:55 UTC

    I think you are confused about encoding. It can be pretty confusing. See perlunitut, "Unicode and Strings" in Modern Perl, The Perl Unicode Cookbook ...

    As you know, if you try to print a "wide" unicode character, Perl gives you a warning:


    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    
    say "Ferrari 308 \x{1F44D}";
    
    __END__
    

    
    $ perl 1140714.pl
    Wide character in say at 1140714.pl line 6.
    Ferrari 308 👍
    $
    


    You can fix this as stevieb pointed out below, with binmode:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    
    say "Ferrari 308 \x{1F44D}";
    
    __END__
    

    $ perl 1140714.pl
    Ferrari 308 👍
    $
    

    If you want to use the unicode characters in your Perl code, you can't just expect Perl to know what they are:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    
    say "Ferrari 308 👍";
    
    __END__
    

    
    $ perl 1140714.pl
    Ferrari 308 Ÿ‘
    

    ... fix that by useing utf8:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    use utf8;
    
    say "Ferrari 308 👍";
    
    __END__
    

    
    $ perl 1140714.pl
    Ferrari 308 👍
    $
    

    If you are going to read in data that might have unicode characters, eg:
    
    $ cat 1140714.txt
    Lotus lan 👍
    任意のスーパーカー
    $
    

    ... you can't expect Perl to know what you're giving it:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    use utf8;
    
    open my $in, '<', '1140714.txt' or die "open: $!\n";
    
    print for (<$in>);
    
    __END__
    

    
    $ perl 1140714.pl
    Lotus ‰lan Ÿ‘
    任„の‚ƒƒ‘ƒ
    

    . . . you can fix that by using an I/O layer in your open:
    
    $ cat 1140714.pl
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    use utf8;
    
    open my $in, '< :utf8', '1140714.txt' or die "open: $!\n";
    
    print for (<$in>);
    
    __END__
    

    
    $ perl 1140714.pl
    Lotus lan 👍
    任意のスーパーカー
    

    If you print your unicode data to a filehandle you'll get the wide-character warning again:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    use utf8;
    
    open my $in,  '< :utf8', '1140714.txt' or die "open: $!\n";
    open my $out, '>',       '1140714.out' or die "open: $!\n";
    
    print $out $_ for (<$in>);
    
    close $out or die "close: $!\n";
    
    __END__
    

    
    $ perl 1140714.pl
    Wide character in print at 1140714.pl line 11, <$in> line 2.
    Wide character in print at 1140714.pl line 11, <$in> line 2.
    $
    

    . . . fix it with an I/O layer:
    
    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw/ say /;
    binmode STDOUT, ':utf8';
    use utf8;
    
    open my $in,  '< :utf8', '1140714.txt' or die "open: $!\n";
    open my $out, '> :utf8', '1140714.out' or die "open: $!\n";
    
    print $out $_ for (<$in>);
    
    close $out or die "close: $!\n";
    
    __END__
    

    
    $ perl 1140714.pl
    $ cat 1140714.out
    Lotus lan 👍
    任意のスーパーカー
    $
    

    No encoding needed at all.

    Hope this helps!

    Update: Added examples and links

    The way forward always starts with a minimal test.
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8
by tangent (Parson) on Sep 01, 2015 at 21:04 UTC
    These garbled strings are accompanied by the error message "Wide character in print"
    How are you opening the .ini file for output - try this if not already:
    open ( my $fh, ">:encoding(utf8)", "file.ini" ) or die "file.ini: $!";
Re: Processing spreadsheet with some cells in ASCII, other cells in UTF-8
by stevieb (Canon) on Sep 01, 2015 at 20:57 UTC

    This is my first ever attempt at non-ascii processing, so I'll let the more experienced Monks criticize if this is the wrong approach, or if there's a better one.

    After a very quick dig online, I found that setting binmode on all the file handles can fix the issue:

    Input file:

    $ cat in.txt }cu}*] hello ›

    Code:

    use warnings; use strict; open my $fh, '<', 'in.txt' or die $!; open my $wfh, '>', 'out.txt' or die $!; binmode $fh, ":utf8"; binmode $wfh, ":utf8"; binmode STDOUT, ":utf8"; while (<$fh>){ chomp; print $wfh "file: $_\n"; print "stdout: $_\n"; }

    Output:

    # output file file: }cu}*] file: hello › # stdout stdout: }cu}*] stdout: hello ›

      The problem in our team's Perl script was fixed simply by adding a one-line change at the top of the script:

      binmode STDOUT, ':utf8'; #This handles the multiple encoding for language menus in the Perl IO layers.

      The above one-line change fixed the problem, the Perl script can now process contents of Excel spreadsheet cells, no matter what type of character encoding was used in them.

      No additional low-level function calls, to decode or encode, were needed.

      This change was taken from one of the replies to my original question. Thanks for your help.

      The various web pages found in the replies, which discussed character encoding, were also very helpful.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1140714]
Approved by 1nickt
Front-paged by 1nickt
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-23 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found