Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^7: Text::CSV encoding parse()

by haukex (Chancellor)
on Aug 15, 2019 at 06:43 UTC ( #11104502=note: print w/replies, xml ) Need Help??


in reply to Re^6: Text::CSV encoding parse()
in thread Text::CSV encoding parse()

Try taking the following code and replacing my $data with your code that fetches the data from the database (as short as possible), and post both your code and the output here. <update> Note that PerlMonks does not handle Unicode inside of <code> tags well, so tell us if you've got any Unicode in there. </update>

#!/usr/bin/env perl
use warnings;
use 5.012;
use utf8; # Perl script file is encoded as UTF-8
use open qw/:utf8 :std/; # reopen STDIN/OUT/ERR as UTF-8
use Text::CSV;
use CGI qw/escapeHTML/;
use CGI::Carp qw/fatalsToBrowser warningsToBrowser/; # for debug ONLY!
use Data::Dumper;
$Data::Dumper::Useqq=1;

my $cgi = CGI->new;
print $cgi->header(-charset=>'UTF-8');
print $cgi->start_html(-title=>'Example', -encoding=>'UTF-8');
warningsToBrowser(1);

my $data = "Euro symbol: € | I \N{U+2764}\N{U+FE0F} \N{U+1F42A}";

print $cgi->pre(escapeHTML( Dumper( $data ) # debugging
	."UTF-8 flag is ".( utf8::is_utf8( $data )?'on':'off' ) ));

print $cgi->p(escapeHTML( $data ));

my $csv = Text::CSV->new ({ binary => 1, sep_char => "|" });
$csv->parse($data);
my ($c1,$c2) = $csv->fields;
print $cgi->p(escapeHTML( "After Text::CSV: ".$c1." | ".$c2 ));

print $cgi->end_html;

(Note: It is better to use utf8::is_utf8() only for debugging.) The output you should see in the browser from the above:

$VAR1 = "Euro symbol: \x{20ac} | I \x{2764}\x{fe0f} \x{1f42a}";
UTF-8 flag is on

Euro symbol: € | I ❤️ 🐪

After Text::CSV: Euro symbol: € | I ❤️ 🐪

Replies are listed 'Best First'.
Re^8: Text::CSV encoding parse()
by Tux (Abbot) on Aug 15, 2019 at 07:53 UTC

    Using the perl internals through Data::Peek's DPeek, you'll see both versions if UTF-8 is in effect without the fragile use of utf8 function calls. It also shows the importance of using utf8 in your example code.

    $ perl -MData::Peek -wE'my $data = "Euro symbol: € | I \N{U+276 +4}\N{U+FE0F} \N{U+1F42A}"; DPeek $data' PV("Euro symbol: \303\242\302\202\302\254 | I \342\235\244\357\270\217 + \360\237\220\252"\0) [UTF8 "Euro symbol: \x{e2}\x{82}\x{ac} | I \x{2 +764}\x{fe0f} \x{1f42a}"] $ perl -Mutf8 -MData::Peek -wE'my $data = "Euro symbol: € | I \N{U+276 +4}\N{U+FE0F} \N{U+1F42A}"; DPeek $data' PV("Euro symbol: \342\202\254 | I \342\235\244\357\270\217 \360\237\22 +0\252"\0) [UTF8 "Euro symbol: \x{20ac} | I \x{2764}\x{fe0f} \x{1f42a} +"] $ perl -Mutf8 -MData::Peek -wE'my $data = "Euro symbol: \xe2\x82\xac | + I \xe2\x9d\xa4\xef\xb8\x8f \xf0\x9f\x90\xaa"; DPeek $data' PV("Euro symbol: \342\202\254 | I \342\235\244\357\270\217 \360\237\22 +0\252"\0)

    Enjoy, Have FUN! H.Merijn
      without the fragile use of utf8 function calls

      Although you're certainly right that several of the functions from utf8:: should be used with extreme caution (or not at all), AFAIK using is_utf8 to check on the flag for debugging (only!) should be fine. I was just trying to provide a slightly "nicer" debugging output because slugger415 said "I'm not a developer, just a Perl hack, don't grok hexdump or Devel::Peek etc.".

      slugger415: I just wanted to add that I wasn't necessarily suggesting you should try to understand the output, it's also very useful information for us to help you debug.

Re^8: Text::CSV encoding parse()
by slugger415 (Monk) on Aug 16, 2019 at 18:35 UTC

    Hello, thank you for the test code. Yes I see the same in the browser with your sample code. I've substituted my own string for yours and here's what I see now in the browser:

    $VAR1 = "/search/\x{bf}Cuales son las partes de una cadena de conexi\x +{f3}n??scope=SSGU8G_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1| +1|0\n"; UTF-8 flag is on /search/¿Cuales son las partes de una cadena de conexión??scope=SSGU8G +_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1|1|0 After Text::CSV: /search/¿Cuales son las partes de una cadena de conex +ión??scope=SSGU8G_12.1.0 | /com.ibm.jdbc_pg.doc/ids_jdbc_011.htm

    BTW I'm assigning $data by reading a stdout output file that is returned from the db via the tool I'm using for the SQL query.

    my($data); open my $fh, "<:encoding(utf8)", $file || die("cannot open $file file\ +n"); while(<$fh>){ if($_ =~ /Cuales/){ $data = $_; print $_; } } close($fh);

    I've also tried opening the file without the :encoding and get the same result.

    open(R, "$resultsFile") || die("cannot open results file $resultsFile +for reading.\n");

    Hope this is helpful.

      I've substituted my own string for yours and here's what I see now in the browser:
      $VAR1 = "/search/\x{bf}Cuales son las partes de una cadena de conexi\x{f3}n??scope=SSGU8G_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1|1|0\n";
      UTF-8 flag is on
      
      /search/¿Cuales son las partes de una cadena de conexión??scope=SSGU8G_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1|1|0
      
      After Text::CSV: /search/¿Cuales son las partes de una cadena de conexión??scope=SSGU8G_12.1.0 | /com.ibm.jdbc_pg.doc/ids_jdbc_011.htm
      

      \x{bf} or U+00BF is the inverted question mark and U+00F3 is “ó”, so AFAICT that looks correct, doesn't it?

      BTW I'm assigning $data by reading a stdout output file that is returned from the db via the tool I'm using for the SQL query.

      Ok, so do I understand correctly that you've tested the code you showed within the script I posted?

      (BTW, why not use DBI to connect directly to the database instead of going through a file?)

      I've also tried opening the file without the :encoding and get the same result.

      Well, if you've got the use open qw/:std :utf8/; that I suggested at the top of the script, that also sets utf8 as the default encoding layer. As long as you're not getting warnings such as "Malformed UTF-8 character: \xbf" or "utf8 "\xBF" does not map to Unicode" (check the web server's error logs and/or run the script from the command line), then it would seem your file is probably encoded as UTF-8, and it would seem that everything is correct...

      So are you still seeing the same problem as before, or is it working now?

        \x{bf} or U+00BF is the inverted question mark and U+00F3 is “ó”, so AFAICT that looks correct, doesn't it?

        Yes it looks correct in my browser.

        Ok, so do I understand correctly that you've tested the code you showed within the script I posted?

        Yes.

        (BTW, why not use DBI to connect directly to the database instead of going through a file?)

        Working on that! The tool I'm currently using doesn't allow it, so I'm running a system command to execute the query. Researching direct DBI access but there are access issues within my org...

        So are you still seeing the same problem as before, or is it working now?

        Sadly problem still exists, yes. To summarize:

        foreach my $row (@sorted_urls){ $csv->parse($row); my @els = $csv->fields; #using Text::CSV my(@splits) = split('\|',$row); #using split on same $row print $q->p($els[0]); #prints wrong print %q->p(splits[0]); #prints right }

        Honestly this is not worth wasting a lot of time on, though I really appreciate your help... thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11104502]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2019-09-16 08:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The room is dark, and your next move is ...












    Results (186 votes). Check out past polls.

    Notices?