Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?

by Jim (Curate)
on Oct 02, 2011 at 21:11 UTC ( #929180=note: print w/ replies, xml ) Need Help??


in reply to Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
in thread Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?

Thank you very much, Tux! Thankfully, one can still get kind and gracious help with Perl problems from Perl experts on PerlMonks.

Upgrading to a bleadperl version of IO isn't practical for me in my environment. So for now, I'll simply not use the open pragma to set default I/O layers and, instead, open files and set I/O layers explicitly in my Perl programs, like this…

#!perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); use Text::CSV_XS; my $csv = Text::CSV_XS->new({ always_quote => 1, binary => 1, eol => $INPUT_RECORD_SEPARATOR, }); binmode STDOUT, ':encoding(UTF-8)'; for my $file (@ARGV) { open my $fh, '<:encoding(UTF-8)', $file; while (my $fields = $csv->getline($fh)) { $csv->print(\*STDOUT, $fields); } $csv->eof() or $csv->error_diag(); close $fh; } exit 0;

Now if I could just figure out how best to handle UTF-8 CSV files that have byte order marks in them. ;-) Text::CSV_XS alone chokes on them. I'm currently doing this…

use File::BOM qw( open_bom ); open my $input_fh, '<:via(File::BOM)', $input_file; open my $output_fh, '>:encoding(UTF-8):via(File::BOM)', $output_file;

Is this The Right Way?


Comment on Re^2: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
Select or Download Code
Re^3: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
by Tux (Monsignor) on Oct 03, 2011 at 06:37 UTC

    Let me try to simplify that a bit ...

    use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ auto_diag => 1, # Let Text::CSV_XS do the analysis always_quote => 1, binary => 1, eol => $INPUT_RECORD_SEPARATOR, }); binmode STDOUT, ':encoding(UTF-8)'; for my $file (@ARGV) { open my $fh, '<:encoding(UTF-8)', $file; while (my $fields = $csv->getline ($fh)) { $csv->print (*STDOUT, $fields); # no need for a reference } # due to auto_diag, no need for error checking here close $fh; }

    If this script is to sanitize CSV data, I'd advice TWO csv objects. One for parsing, that does not pass the always_quote and eol attribute, and one for output. The advantage is that all legal line-endings are parsed well automatically, even if mixed.

    I have no neat way to the BOM problem other than what you already use.


    Enjoy, Have FUN! H.Merijn

      Thank you, again, Tux. I genuinely appreciate the tips. I'll brush up on auto_diag.

      The BOM is a nuisance, especially in CSV files. In one of my real programs that uses Text::CSV_XS (what I posted here is a reduction that simply demonstrates a specific problem I was having), I'm stymied by the confluence of byte order marks in UTF-8 files that force me to use File::BOM and, unfortunately, some malformed UTF-8 text in the data that kills CSV parsing with this unforgiving error message:

      utf8 "\xEC" does not map to Unicode at C:/strawberry/perl/lib/Encode.p +m line 176.

      I don't know how to tell Text::CSV_XS or File::BOM to tell Encode to lighten up already about one or two bogus characters! :-(

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929180]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2014-08-30 14:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (293 votes), past polls