http://www.perlmonks.org?node_id=1012713

MorayJ has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm kind of struggling with the whole encoding thing as have a csv file that contains, when viewed in MS Outlook, the character(s) £ representing the £ sign.

I suppose what I don't know, is what character set this is, and what character set Outlook is using.

I am trying to use the file with the following code:

#!/usr/bin/perl use encoding 'utf8'; use Text::CSV; use strict; use warnings; my @survey; my $csv = Text::CSV->new ( { binary => 1 } ) or die "Cannot use CSV: " + .Text::CSV->error_diag (); open my $fh, "<:encoding(utf8)", "../export.csv" or die &error_alert; while ( my $row = $csv->getline( $fh ) ) { my $feedback = $row->[1]; push @survey, $feedback; } close $fh; open (my $output, ">:encoding(utf8)", "../surveydata.csv") or die &err +or_alert; shift @survey; foreach my $item (@survey) { my @data = split /\t/, $item; my $lastitem = pop @data; chomp $lastitem; $lastitem =~ s/"//g; foreach my $col (@data) { $col =~ s/"//g; print $output "\"$col\"\,"; } print $output "\"$lastitem\"\n"; }; close $output;

Am I right in thinking that I need to find the correct character set for the import - i.e. If the csv file contains the odd character, it is probably not utf8 and needs to be called in with the right character set?

Thanks for looking at this

MorayJ

Replies are listed 'Best First'.
Re: Using encoding
by Kenosis (Priest) on Jan 10, 2013 at 18:31 UTC

    Add binmode $output; right after opening the CSV file for writing. For example:

    use strict; use warnings; my $string = '£'; open my $output, '>:encoding(utf8)', 'out.txt' or die $!; binmode $output; print $output $string; close $output;

    Contents of out.txt:

    £

    Contents of out.txt without binmode $output;:

    £
      It's more correct to write:
      use strict; use warnings; use utf8; my $string = '£'; open my $output, '>:encoding(utf8)', 'out.txt' or die $!; #binmode $output, ":encoding(utf8)"; print $output $string; close $output;
      (with or without 'binmode $output, ":encoding(utf8)"' but without 'binmode $output;')
Re: Using encoding
by MorayJ (Beadle) on Jan 14, 2013 at 00:01 UTC

    Hi, thanks for the answers. I'm actually thinking that I got the question wrong now though.

    I think the input is text. The original input came with a £ sign, but this is now ascii text, possibly extended. But I can't work out what the '£' sign has been translated into.

    It's appearing as ¶œ in notepad.

    I have tried:

    my $character = ord("¶œ"); $lastitem =~ s/$character/Pounds/g;

    This still isn't getting it. I think I must be approaching this totally wrong. The text seems to be consistently representing the pound symbol with a character, or a number of characters, and I don't know how to isolate that.

    What tools should I be looking at?

    Thanks for your help - sorry for making a meal out of the question.

      ord only takes care about the first character of a string. So, $character gets assigned 194. You are then replacing "194" (as a string) by "Pounds", which does not do what you want.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      in Windows the translation is from Unicode (UTF16) to ANSI according to your System "Language for non Unicode programs". So the pound sign will be broken down to bytes according to it

        Ok, I think that makes sense. So ord is not what I'm after

        What's the best way to find 'funny' characters in a text file, and to translate them into meaningful characters in a text/unicode file?

        I'm assuming that it's me that's making this difficult and it's probably quite straight forward