Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? ("XS")

by tye (Cardinal)
on Oct 02, 2011 at 06:15 UTC ( #929105=note: print w/ replies, xml ) Need Help??


in reply to Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?

You expect an XS module opening a file to fall into the category of "Any two-argument open(), readpipe() (aka qx//) and similar operators found within the lexical scope of this pragma will use the declared defaults" ?

I don't.

- tye        


Comment on Re: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? ("XS")
Re^2: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? ("XS")
by Jim (Curate) on Oct 02, 2011 at 06:45 UTC

    I expect there to be an easy way in Perl to use built-in default idioms, to assert that my input and output are in the UTF-8 character encoding form of Unicode, and to use CPAN modules, all at the same time, and without having to know what an "XS module" is.

    Specifically, I want to process many CSV files that I feed to the Perl program via @ARGV. I want to use the CPAN module Text::CSV_XS to parse the CSV records. I don't want to open and close files explicitly; I want Perl to open and close them for me implicitly. I want to continue to use Perl's built-in idioms that permit me to avoid needless extra programming, just as I always have.

      Your unexpected output seems ISO-8859-1 output of the SPADE charcters. Probably, If you put the output to the text, and See the results in your browser with utf-8 encoding, You'see the SPADE.
      print qq("BLACK SPADE SUIT","BLACK HEART SUIT","BLACK DIAMOND SUIT","B +LACK CLUB SUIT",\n); #decimail unicode character for above; my @ary=("♠","♥","♦","♣"); foreach my $target(@ary) { $target =~ s/\&#(.*);/$1/; print '"' . encode('utf8', chr($target)) . '",'; } print "\n";
      I mean , this is terminal problem , doesn't it ?

        I mean , this is terminal problem , doesn't it ?

        No. It looks similar but no. The problem, in a nutshell, if you use warn "$ARGV $_ " for PerlIO::get_layers(*ARGV) you can see ARGV doesn't get utf8 io layer, only STDIN gets them

        $ perl ... utf8wobom.csv >bad utf8wobom.csv unix at ... utf8wobom.csv crlf at ... $ perl ... < utf8wobom.csv >good - unix at ... - crlf at ... - encoding(utf-8-strict) at ... - utf8 at ...

        In my non-utf terminal it shows

        $ ls -loanh good bad -rw-rw-rw- 1 0 115 2011-10-02 01:31 bad -rw-rw-rw- 1 0 103 2011-10-02 01:31 good $ diff good bad 2c2 < "Γ","Γ","Γ֪","Γ" --- > "├┬┬","├┬┬","├┬┬","├┬┬"
Re^2: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? ("XS")
by Anonymous Monk on Oct 02, 2011 at 07:49 UTC

    You expect an XS module opening a file ..

    But it isn't opening a file, its reading from a filehandle, sure ARGV its magic, but CSV_XS isn't doing the opening

      If Tux is correct and this has been "fixed", then I think the documentation for open.pm should be corrected. I certainly don't see how the offered code qualifies for:

      "Any two-argument open(), readpipe() (aka qx//) and similar operators found within the lexical scope of this pragma"

      I haven't dived into the guts (well, I have dived into guts related to open.pm but not recently and not in relation to this specific case), but it appears that the only thing within the lexical scope of the pragma is the passing of a file handle to an XS module. That XS module reads from the handle and the reading from the handle triggers "magic" (as you put it) that causes a file to be opened.

      The opening is not done by code within the lexical scope of the pragma. Perhaps the documentation should say that it impacts 'open' within the temporal scope of the pragma? I doubt it actually does that, though (that wouldn't match my memory of the guts the last time I dived into them).

      But if it isn't temporal scope, then I'm hard pressed to explain how it could actually work in this case. Perhaps somebody will explain it. I don't plan to spend time investigating this particular mystery.

      I doubt the original poster's expressed desire for ignorance will lead to success when dealing with UTF-8 streams. Unfortunately, UTF-8 was defined in a way and supported by Unix (and Perl) in ways that make handling it correctly very often require significant diving into a lot of details.

      - tye        

        I don't plan to spend time investigating this particular mystery.

        If you're someone who has a deep understanding of how the Perl programming language works and the know-how to help fix it when it's broken—and I suspect you are—then perhaps you should spend time investigating this particular mystery. If you help make Perl more intuitive to use ("DWIM"), then you improve the language, which benefits the Perl community.

        Posting snarky, condescending responses to the earnest inquiries of causal Perl programmers on PerlMonks doesn't improve the language or help the Perl community, and so isn't the best use of a Perl expert's time. It's especially unhelpful if the obtuse point one is trying to make turns out to be wrong.

        I doubt the original poster's expressed desire for ignorance will lead to success when dealing with UTF-8 streams. Unfortunately, UTF-8 was defined in a way and supported by Unix (and Perl) in ways that make handling it correctly very often require significant diving into a lot of details.

        You're right that grappling with Unicode in Perl is too often unduly tricky and obscure. But in most cases, as in this case, simple, ordinary tasks should be more straightforward. After all, "Easy things should be easy and hard things should be possible." Reading and writing trivial CSV records encoded in UTF-8 is most assuredly an "easy thing," not a "hard thing," isn't it?

        (I'm the original poster, and I'm using Windows, not Unix. I made this clear in my original post. Also, UTF-8 is an ingenious encoding scheme that accomplishes its multiple objectives brilliantly. It wasn't defined in a way such that handling it correctly by programmers using modern programming languages and software libraries must inevitably be more difficult than handling text in any other character encoding by those same programmers. You can't blame Unicode here.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929105]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2014-09-17 22:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (100 votes), past polls