http://www.perlmonks.org?node_id=11104403

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hello esteemed monks, I am using Text::CSV to parse an array of text strings (pipe delimited) and want to use UTF-8 encoding to read the strings. In the doc at https://metacpan.org/pod/Text::CSV#new I see this instruction:

On parsing (both for "getline" and "parse"), if the source is marked being UTF8, then all fields that are marked binary will also be marked UTF8.

I have set my 'new' instance to binary, and it mostly works, except some accented characters are showing up in the resulting web page as black diamond question marks, e.g. conexi�n. (Japanese and other language characters look fine.) Is there something else I need to set? If I don't use Text::CSV and just 'split' the strings, those characters look fine, and correct.

my $csv = Text::CSV->new ({ binary => 1, sep_char => "|" }); foreach my $row (@sorted_urls){ $csv->parse($row); # processing }

Thank you.

Replies are listed 'Best First'.
Re: Text::CSV encoding parse()
by haukex (Archbishop) on Aug 13, 2019 at 18:05 UTC
    some accented characters are showing up in the resulting web page as black diamond question marks

    Are you sure you've also set your output filehandles to the correct encoding, and have specified that encoding in the HTML? Please provide a Short, Self-Contained, Correct Example.

    To debug the input end of the process, see my suggestions at Re: Parsing Problems (updated).

      Hi, yes I'm using the CGI module and have it properly set:

      print $q->header(-charset    => 'utf-8');

      And as mentioned if I don't use Text::CVS the characters display correctly.

        Hi, yes I'm using the CGI module and have it properly set: print $q->header(-charset => 'utf-8'); And as mentioned if I don't use Text::CVS the characters display correctly.

        Ok, but I'm sorry, there still isn't enough information to answer your question - have another look at my reply above, plus the links therein.

        That means that you are declaring to the browser that your output is UTF-8. Is it actually UTF-8?