Re^4: Text::CSV encoding parse()

Hello, ok here's as short and succinct a sample as I can create.

use Text::CSV;
use CGI;

my($row) = "search/¿Cuales son las partes de una cadena de conexión??s
+cope|ids_jdbc_011.htm|0|1|1|0";

my $csv = Text::CSV->new ({ binary => 1, sep_char => "|" });
my $q = new CGI;

# print the HTML header and start html
print $q->header;
print $q->start_html;

 # first, print $row as is
 print $q->p("ROW: $row");

 # next, parse with $csv
 $csv->parse($row);
 my @els = $csv->fields;

 # print the first field
 # this displays the black diamond ? symbol for ¿ and ó
 print $q->p("CSV Parse, field 0:",$els[0]);
 
 # split instead
 my(@splits) = split('\|',$row);
 
 # print the first element in @splits.
 # As noted, this one displays properly in the browser.
 print $q->p("split 0:", $splits[0]);
 
print $q->end_html;
 
exit;
[download]

thanks

======================

UMM, update, when I actually ran the above in my http server I got the opposite results, but with weird errors.

ROW: search/Â¿Cuales son las partes de una cadena de conexiÃ³n??scope|
+ids_jdbc_011.htm|0|1|1|0

CSV Parse, field 0: search/¿Cuales son las partes de una cadena de con
+exión??scope

split 0: search/Â¿Cuales son las partes de una cadena de conexiÃ³n??sc
+ope
[download]

Paint me confused.

In the real script, $row is coming from a @sorted_array from an SQL query. This is getting confusing so maybe I should withdraw my question.

Comment on Re^4: Text::CSV encoding parse() Select or Download Code

Replies are listed 'Best First'.
Re^5: Text::CSV encoding parse() by haukex (Archbishop) on Aug 14, 2019 at 19:55 UTC
I don't see any mention of any encoding in this code, which is not good. And earlier you said: "I'm using the CGI module and have it properly set: `print $q->header(-charset => 'utf-8');`" so I doubt this code is representative. You need to: Use a Perl version >= 5.12 and say `use feature 'unicode_strings';` or `use 5.012;` (or higher). If you have any non-ASCII characters in your Perl script, save it as UTF-8 and add the `use utf8;` directive at the top. Make sure your data is coming from the database properly encoded. As I linked to above, you can check this via Devel::Peek. If you need that output to go to the browser, see this. Make sure you are doing `binmode STDOUT, ':encoding(UTF-8)';` or `use open qw/:std :utf8/;`. Make sure you are telling your browser what encoding you are sending it. Text::CSV is not the problem: `use warnings; use strict; use Devel::Peek; use Text::CSV; my $str = "\N{U+20AC}\|\N{U+20AC}"; Dump($str); # ... UTF8 "\x{20ac}\|\x{20ac}" ... my ($s1,$s2) = split /\\|/, $str; Dump($s1); # ... UTF8 "\x{20ac}" ... Dump($s2); # ... UTF8 "\x{20ac}" ... my $csv = Text::CSV->new ({ binary => 1, sep_char => "\|" }); $csv->parse($str); my ($c1,$c2) = $csv->fields; Dump($c1); # ... UTF8 "\x{20ac}" ... Dump($c2); # ... UTF8 "\x{20ac}" ...` [download]	[reply] [d/l] [select]
Re^6: Text::CSV encoding parse() by slugger415 (Monk) on Aug 14, 2019 at 21:41 UTC
Hello haukex, thanks a million for your advice, but I have to confess this is way beyond my understanding or abilities in many ways, so I think I'm going to have to live with it. FWIW every time encoding problems come up I get lost in the weeds. (I'm not a developer, just a Perl hack, don't grok hexdump or Devel::Peek etc.) All I know in this case is I can print my @sorted_array rows to a flat file (opened with Notepad++) or to a web page using CGI, and those characters look fine. It's only when I use Text:CSV that something goes haywire. Anyway I appreciate your patience and help, sorry for the trouble.	[reply]
Re^7: Text::CSV encoding parse() by haukex (Archbishop) on Aug 15, 2019 at 06:43 UTC
Try taking the following code and replacing `my $data` with your code that fetches the data from the database (as short as possible), and post both your code and the output here. `<update>` Note that PerlMonks does not handle Unicode inside of `<code>` tags well, so tell us if you've got any Unicode in there. `</update>` #!/usr/bin/env perl use warnings; use 5.012; use utf8; # Perl script file is encoded as UTF-8 use open qw/:utf8 :std/; # reopen STDIN/OUT/ERR as UTF-8 use Text::CSV; use CGI qw/escapeHTML/; use CGI::Carp qw/fatalsToBrowser warningsToBrowser/; # for debug ONLY! use Data::Dumper; $Data::Dumper::Useqq=1; my $cgi = CGI->new; print $cgi->header(-charset=>'UTF-8'); print $cgi->start_html(-title=>'Example', -encoding=>'UTF-8'); warningsToBrowser(1); my $data = "Euro symbol: € \| I \N{U+2764}\N{U+FE0F} \N{U+1F42A}"; print $cgi->pre(escapeHTML( Dumper( $data ) # debugging ."UTF-8 flag is ".( utf8::is_utf8( $data )?'on':'off' ) )); print $cgi->p(escapeHTML( $data )); my $csv = Text::CSV->new ({ binary => 1, sep_char => "\|" }); $csv->parse($data); my ($c1,$c2) = $csv->fields; print $cgi->p(escapeHTML( "After Text::CSV: ".$c1." \| ".$c2 )); print $cgi->end_html; (Note: It is better to use `utf8::is_utf8()` only for debugging.) The output you should see in the browser from the above: $VAR1 = "Euro symbol: \x{20ac} \| I \x{2764}\x{fe0f} \x{1f42a}"; UTF-8 flag is on Euro symbol: € \| I ❤️ 🐪 After Text::CSV: Euro symbol: € \| I ❤️ 🐪	[reply] [d/l] [select]
Re^8: Text::CSV encoding parse() by Tux (Canon) on Aug 15, 2019 at 07:53 UTC
Re^9: Text::CSV encoding parse() by haukex (Archbishop) on Aug 15, 2019 at 08:26 UTC
Re^8: Text::CSV encoding parse() by slugger415 (Monk) on Aug 16, 2019 at 18:35 UTC
Re^9: Text::CSV encoding parse() by haukex (Archbishop) on Aug 17, 2019 at 06:47 UTC
Some notes below your chosen depth have not been shown here
Re^7: Text::CSV encoding parse() by jcb (Parson) on Aug 14, 2019 at 23:16 UTC
I am sorry, but you will need to learn to use hex dumps. They are the only reliable way to solve this type of encoding problem, where the "bytes on the wire" are what matters. Have you tried using Wireshark to observe the actual network traffic? It offers a convenient hex dump view of each packet.	[reply]
Re^5: Text::CSV encoding parse() by choroba (Cardinal) on Aug 14, 2019 at 19:35 UTC
In what encoding have you saved the source code? The recommended practice is to use UTF-8 and tell Perl that your source code contains non-ascii UTF-8 characters (i.e. use utf8). `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^6: Text::CSV encoding parse() by slugger415 (Monk) on Aug 14, 2019 at 20:50 UTC
Hi Choroba, my editor Notepad++ is set to UTF-8.	[reply]