Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Text::CSV and Unicode

by vsespb (Chaplain)
on Nov 11, 2013 at 10:19 UTC ( #1061971=perlquestion: print w/replies, xml ) Need Help??
vsespb has asked for the wisdom of the Perl Monks concerning the following question:

use strict; use warnings; use Text::CSV; use Data::Dumper; use Devel::Peek; open my $f, ">", "test.tmp"; print $f "\xD0\x81"; close $f; open my $io, "<", "test.tmp" or die "$!"; my $csv = Text::CSV->new ({ binary => 1, eol => "\012" }); while (my $row = $csv->getline ($io)) { print Dumper $row; Dump $row; }
$VAR1 = [ "\x{401}" ]; SV = IV(0x245c998) at 0x245c9a8 REFCNT = 1 FLAGS = (PADMY,ROK) RV = 0x2550980 SV = PVAV(0x23f49b8) at 0x2550980 REFCNT = 2 FLAGS = () ARRAY = 0x24bdab0 FILL = 0 MAX = 3 ARYLEN = 0x0 FLAGS = (REAL) Elt No. 0 SV = PV(0x23f22a0) at 0x2412470 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x260e200 "\320\201"\0 [UTF8 "\x{401}"] CUR = 2 LEN = 16
that means it interprets input as UTF-8 data and parses it to perl character string

Question is - why?
In documentation I see:
On parsing (both for getline () and parse ()), if the source is marked being UTF8, then all fields that are marked binary will also be be marked UTF8.

But I did not "mark" anything as UTF-8. And what is "marking" here?

How to control this behaviour - when to parse as binary data and when as character strings?

Note that I don't want to use Text::CSV::Encoded, as it's broken now

Replies are listed 'Best First'.
Re: Text::CSV and Unicode
by Tux (Abbot) on Nov 11, 2013 at 10:45 UTC

    This has come up before, and as of Text::CSV_XS version 1.00, the behavior is now consistent. It however does not meet your current needs. I just uploaded version 1.02 a minute ago, as that now has a new attribute decode_utf8 that enables you to disable the default behavior (which is what has proven to be what most people want and expect).

    decode_utf8 This attributes defaults to TRUE. While parsing, fields that are valid UTF-8, are automatical +ly set to be UTF-8, so that $csv->parse ("\xC4\xA8\n"); results in PV("\304\250"\0) [UTF8 "\x{128}"] Sometimes it might not be a desired action. To prevent thos +e upgrades, set this attribute to false, and the result will +be PV("\304\250"\0)

    I realize that "most people" is not "all people" and I cannot make a default that makes "all" people happy. That is also the reason why I waited with 1.02. I have asked many users about what should be the default and also check the historical entries in RT and my mail and came to the conclusion that nowadays the majority works with UTF8 CSV more than with binary CSV. The change in 1.00 was not to enable UTF-8 or to disable it. The change was to make it work more consistently.

    Enjoy, Have FUN! H.Merijn
      Ok, thank you ! I will try new version
      conclusion that nowadays the majority works with UTF8 CSV

      That's no problem, I just wanted this behaviour to be clearly documented (I did not want to rely on something undocumented).
        Oddly my similar issues with Text::CSV were resolved when I simply installed Text::CSV_XS. There must be a shared library that gets updated or something...
Re: Text::CSV and Unicode
by Jim (Curate) on Nov 11, 2013 at 17:57 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1061971]
Approved by Corion
Front-paged by Arunbear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (13)
As of 2017-12-14 13:30 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (394 votes). Check out past polls.