http://www.perlmonks.org?node_id=929103

Jim has asked for the wisdom of the Perl Monks concerning the following question:

This script…

#!perl use strict; use warnings; use open qw( :encoding(UTF-8) :std ); use Text::CSV_XS; my $csv = Text::CSV_XS->new({ always_quote => 1, binary => 1, eol => $/, }); while (my $fields = $csv->getline(*ARGV)) { $csv->print(\*STDOUT, $fields); } $csv->eof() or $csv->error_diag(); exit 0;

…produces this output…

"BLACK SPADE SUIT","BLACK HEART SUIT","BLACK DIAMOND SUIT","BLACK CLUB SUIT"
"♠","♥","♦","♣"

…from this UTF-8 input…

"BLACK SPADE SUIT","BLACK HEART SUIT","BLACK DIAMOND SUIT","BLACK CLUB SUIT"
"♠","♥","♦","♣"

Why?

If I remove the use open statement, the output is identical to the input. This is exactly the reverse of the behavior I expect.

I'm using Strawberry Perl 5.12 for Windows and Text::CSV_XS version 0.85.

(I don't know the answer to the question I'm asking despite having dutifully read all the relevant documentation. Just pointing me toward a perldoc page won't help me. Thanks.)

Replies are listed 'Best First'.
Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
by Tux (Canon) on Oct 02, 2011 at 07:40 UTC

    This is a known bug, and has just recently been fixed. There is nothing that Text::CSV_XS can do about. It should "jut work" (TM) if you can upgrade IO to the version that includes this patch.


    Enjoy, Have FUN! H.Merijn

      Thank you very much, Tux! Thankfully, one can still get kind and gracious help with Perl problems from Perl experts on PerlMonks.

      Upgrading to a bleadperl version of IO isn't practical for me in my environment. So for now, I'll simply not use the open pragma to set default I/O layers and, instead, open files and set I/O layers explicitly in my Perl programs, like this…

      #!perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); use Text::CSV_XS; my $csv = Text::CSV_XS->new({ always_quote => 1, binary => 1, eol => $INPUT_RECORD_SEPARATOR, }); binmode STDOUT, ':encoding(UTF-8)'; for my $file (@ARGV) { open my $fh, '<:encoding(UTF-8)', $file; while (my $fields = $csv->getline($fh)) { $csv->print(\*STDOUT, $fields); } $csv->eof() or $csv->error_diag(); close $fh; } exit 0;

      Now if I could just figure out how best to handle UTF-8 CSV files that have byte order marks in them. ;-) Text::CSV_XS alone chokes on them. I'm currently doing this…

      use File::BOM qw( open_bom ); open my $input_fh, '<:via(File::BOM)', $input_file; open my $output_fh, '>:encoding(UTF-8):via(File::BOM)', $output_file;

      Is this The Right Way?

        Let me try to simplify that a bit ...

        use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ auto_diag => 1, # Let Text::CSV_XS do the analysis always_quote => 1, binary => 1, eol => $INPUT_RECORD_SEPARATOR, }); binmode STDOUT, ':encoding(UTF-8)'; for my $file (@ARGV) { open my $fh, '<:encoding(UTF-8)', $file; while (my $fields = $csv->getline ($fh)) { $csv->print (*STDOUT, $fields); # no need for a reference } # due to auto_diag, no need for error checking here close $fh; }

        If this script is to sanitize CSV data, I'd advice TWO csv objects. One for parsing, that does not pass the always_quote and eol attribute, and one for output. The advantage is that all legal line-endings are parsed well automatically, even if mixed.

        I have no neat way to the BOM problem other than what you already use.


        Enjoy, Have FUN! H.Merijn
Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? ("XS")
by tye (Sage) on Oct 02, 2011 at 06:15 UTC

    You expect an XS module opening a file to fall into the category of "Any two-argument open(), readpipe() (aka qx//) and similar operators found within the lexical scope of this pragma will use the declared defaults" ?

    I don't.

    - tye        

      I expect there to be an easy way in Perl to use built-in default idioms, to assert that my input and output are in the UTF-8 character encoding form of Unicode, and to use CPAN modules, all at the same time, and without having to know what an "XS module" is.

      Specifically, I want to process many CSV files that I feed to the Perl program via @ARGV. I want to use the CPAN module Text::CSV_XS to parse the CSV records. I don't want to open and close files explicitly; I want Perl to open and close them for me implicitly. I want to continue to use Perl's built-in idioms that permit me to avoid needless extra programming, just as I always have.

        Your unexpected output seems ISO-8859-1 output of the SPADE charcters. Probably, If you put the output to the text, and See the results in your browser with utf-8 encoding, You'see the SPADE.
        print qq("BLACK SPADE SUIT","BLACK HEART SUIT","BLACK DIAMOND SUIT","B +LACK CLUB SUIT",\n); #decimail unicode character for above; my @ary=("&#9824;","&#9829;","&#9830;","&#9827;"); foreach my $target(@ary) { $target =~ s/\&#(.*);/$1/; print '"' . encode('utf8', chr($target)) . '",'; } print "\n";
        I mean , this is terminal problem , doesn't it ?

      You expect an XS module opening a file ..

      But it isn't opening a file, its reading from a filehandle, sure ARGV its magic, but CSV_XS isn't doing the opening

        If Tux is correct and this has been "fixed", then I think the documentation for open.pm should be corrected. I certainly don't see how the offered code qualifies for:

        "Any two-argument open(), readpipe() (aka qx//) and similar operators found within the lexical scope of this pragma"

        I haven't dived into the guts (well, I have dived into guts related to open.pm but not recently and not in relation to this specific case), but it appears that the only thing within the lexical scope of the pragma is the passing of a file handle to an XS module. That XS module reads from the handle and the reading from the handle triggers "magic" (as you put it) that causes a file to be opened.

        The opening is not done by code within the lexical scope of the pragma. Perhaps the documentation should say that it impacts 'open' within the temporal scope of the pragma? I doubt it actually does that, though (that wouldn't match my memory of the guts the last time I dived into them).

        But if it isn't temporal scope, then I'm hard pressed to explain how it could actually work in this case. Perhaps somebody will explain it. I don't plan to spend time investigating this particular mystery.

        I doubt the original poster's expressed desire for ignorance will lead to success when dealing with UTF-8 streams. Unfortunately, UTF-8 was defined in a way and supported by Unix (and Perl) in ways that make handling it correctly very often require significant diving into a lot of details.

        - tye        

Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
by BrowserUk (Patriarch) on Oct 02, 2011 at 06:19 UTC

    If you add -C to your command line and drop the open pragma, it might work.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I tried it. It doesn't work.

        Really? It should:

        -C [number/list] The -C flag controls some of the Perl Unicode features. As of 5.8.1, the -C can be followed either by a number or a list of op +tion letters. The letters, their numeric values, and effects are as f +ollows; listing the letters is equal to summing the numbers. I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded in UTF-8 L 64 normally the "IOEioA" are unconditional, the L makes them conditional on the locale environment variables (the LC_ALL, LC_TYPE, and LANG, in the order of decreasing precedence) -- if the variables indicate UTF-8, then the selected "IOEioA" are in effect a 256 Set ${^UTF8CACHE} to -1, to run the UTF-8 caching code i +n debugging mode. For example, -COE and -C6 will both turn on UTF-8-ness on both STDOUT +and STDERR. Repeating letters is just redundant, not cumulative nor t +oggling. The io options mean that any subsequent open() (or similar I/O operati +ons) will have the :utf8 PerlIO layer implicitly applied to them, in +other words, UTF-8 is expected from any input stream, and UTF-8 is pr +oduced to any output stream. This is just the default, with explicit +layers in open() and with binmode() one can manipulate streams as usu +al. -C on its own (not followed by any number or option list), or the empt +y string "" for the PERL_UNICODE environment variable, has the same e +ffect as -CSDL. In other words, the standard I/O handles and the defa +ult open() layer are UTF-8-fied but only if the locale environment va +riables indicate a UTF-8 locale. This behaviour follows the implicit +(and problematic) UTF-8 behaviour of Perl 5.8.0.

        Maybe you should perlbug the error.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
by Anonymous Monk on Oct 02, 2011 at 08:23 UTC

    Its one or more perl bugs :/

    Magic diamond doesn't respect open pragma

    open pragma is broken

    Both or something else