Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

CSV_XS and UTF8 strings

by beerman (Novice)
on Oct 18, 2011 at 15:39 UTC ( #932179=perlquestion: print w/ replies, xml ) Need Help??
beerman has asked for the wisdom of the Perl Monks concerning the following question:

The code below just reads and writes a CSV file. If the file 'test.csv', contains only ASCII characters, the output file 'new.csv' will not have any quotes around strings with embedded spaces. This is expected behavior as the code is using quote_space => 0 BUT if the string has non-ASCII characters with embedded spaces, the string in the output file has quotes around the string. It appears this is a bug in the csv_xs module as I expect to get same results when using quote_space => 0 regardless of the type of characters in the string. I really don't want the quotes so my question is how to get this code to work with ASCII and non-ASCII (UTF8 data)?

I want the code to work that is 'no quotes' around text that has embedded spaces. This code works for ASCII data (sample below) but does not work when the text has UTF8 characters (sample below). Not working means the string with the UTF8 characters get double quotes around it if such a string has embedded spaces. This is the wrong behavior when using quote_space => 0

this is the first test.csv. It has Japanese characters (UTF8)

hi,bye,test is great,test what is your name,is,これ 試験 t +his is a test,test

The file 'new.csv' created from the script is now shown

hi,bye,test is great,test what is your name,is,"これ 試験  +this is a test",test

The problem is the double quotes around the string with Japanese characters. The expectation is that all strings regards of what type of characters are used will not have double quotes around the string if such string has embedded characters when using quote_space => 0

use Text::CSV_XS; use encoding 'utf8'; my @rows; my $csv = Text::CSV_XS->new ({ quote_space => 0, binary => 1 }) or die "Cannot use CSV: ".Text::CSV_XS->error_diag (); open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; while (my $row = $csv->getline ($fh)) { push @rows, $row; } $csv->eof or $csv->error_diag (); close $fh; $csv->eol ("\r\n"); open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!"; $csv->print ($fh, $_) for @rows; close $fh or die "new.csv: $!";

Replies are listed 'Best First'.
Re: CSV_XS and UTF8 strings
by Tux (Abbot) on Oct 18, 2011 at 17:29 UTC

    quote_space is just to decide if values that have space(s) are to be considered for quotation and has nothing to do with anything else.

    $ perl -MText::CSV_XS -wE'Text::CSV_XS->new({binary=>1,auto_diag=>1,qu +ote_space=>0,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b "])' ,, ,1,a b $ perl -MText::CSV_XS -wE'Text::CSV_XS->new({binary=>1,auto_diag=>1,qu +ote_space=>1,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b "])' ,," ",1,"a b "

    There is no option (yet) to prevent quotation for (valid) UTF8 other than the aforementioned undef, which will disable quotation altogether, which is wrong in most cases.

    $ perl -MText::CSV_XS -C3 -wE'Text::CSV_XS->new({binary=>1,auto_diag=> +1,quote_space=>0,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b ","\ +x{20ac}"])' ,, ,1,a b ,"" $ perl -MText::CSV_XS -C3 -wE'Text::CSV_XS->new({binary=>1,auto_diag=> +1,quote_space=>1,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b ","\ +x{20ac}"])' ,," ",1,"a b ",""

    Would you be so kind to explain why quotation around Unicode strings is wrong in your perception? In theory, quotation never harms.

    Enjoy, Have FUN! H.Merijn
      I understand the rules for proper CSV formats and thus know that putting double quotes around strings with spaces is correct according to these CSV formatting rules. My concern is that the original CSV file does not have any double quotes around strings with spaces. This is an English Resource file and I'm creating a Japanese resource source file. The concern is that the program reading the CSV files may have problems when it encounters the double quotes around the Japanese string since the original English string did not have these. I know I can then tell the developer that the program should be able to handle properly formatted CSV but it is a hassle working with the developers so if I could create the Japanese CSV with same formatting than I won't have to worry about whether their program works with the double quotes around the Japanese string. I also do a lot of work with Unicode and do get frustrated when there are inconsistencies across languages. Characters are characters and it should not matter what language. Unfortunately, there is an inconsistency with the use of "quote_space => 0". As demonstrated in my data examples, a data file with just English (ASCII characters) processed by my script results in exactly the same format. That means if a string with spaces did not have quotes, the new file carries over this same format BUT if the data file has Unicode (UTF8) characters with spaces than the formatting changes and double quotes are added to this string even though the purpose of "quote_space => 0" is to not add these quotes.

        Maybe you would be happier with just join ",", ... instead of a module meant to produce properly formatted CSV (since you don't seem to actually want properly formatted CSV). Your desired output doesn't sound even close to rocket surgery, so I don't see much point requiring the module.

        - tye        

        So what you want is a new option to disable the need for quotation on characters with code-points > 127?

        Note that the quote_space isn't even tested when writing the fields with the utf-8 characters. It is just tested when a space is encountered inside a field. While scanning a field, there is a flag that is set when quotation is required. When the flag has been set already by whatever other trigger, further tests are skipped. In your example that flag was already triggered by the first "binary" character, so the quote_space is effectively a no-op in your code.

        I'm however not sure that I want to implement such a new feature as it will potentially create invalid CSV. OTOH it will be an option that is only used on writing CSV, which is relatively easy to change.

        The current quote trigger is like:

        if (c < csv->first_safe_char || (c >= 0x7f && c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

        A new flag could make that into something like

        if (c < csv->first_safe_char || (csv->quote_binary && c >= 0x7f && + c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

        Leaving it safe for all ASCII binary. I could do that.

        update done

        Text-CSV_XS $ cat use strict; use warnings; binmode STDOUT, ":utf8"; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n +" }); $csv->quote_binary (1); # default $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); $csv->quote_binary (0); $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); Text-CSV_XS $ perl -Iblib/{lib,arch} ,," ",1,"a b ","" ,," ",1,"a b ", Text-CSV_XS $

        Enjoy, Have FUN! H.Merijn
Re: CSV_XS and UTF8 strings
by Tux (Abbot) on Oct 18, 2011 at 15:45 UTC

    I read that three times and still have no idea what you want. Please post data-in and expected data-out to visualize the wish/expectation.

    Enjoy, Have FUN! H.Merijn
Re: CSV_XS and UTF8 strings
by Khen1950fx (Canon) on Oct 18, 2011 at 16:56 UTC
    To disable quotes, I use quote_char => undef in the constructor. Also, binmode after open helps.
    #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $file = 'sample.csv'; my @rows; my $csv = Text::CSV_XS->new( { quote_space => 0, quote_char => undef, binary => 1, auto_diag => 1, } ) or die "Cannot use CSV: " . Text::CSV_XS->error_diag(); open my $fh, '<:encoding(UTF-8)', $file or die "$!"; binmode $fh, ':encoding(UTF-8)'; while ( my $row = $csv->getline($fh) ) { push @rows, $row; } $csv->eof or $csv->error_diag(); close $fh; $csv->eof; open $fh, '>:encoding(UTF-8)', 'new.csv' or die "$!"; binmode $fh, ":encoding(UTF-8)"; $csv->print( $fh, $_ ) for @rows; close $fh or die "$!"; Output: hi,bye,test is great,testwhat is your name,is,&#12371;&#12428;&#12288; +&#35430;&#39443;&#12288;this is a test,test

      Do you have an example where doubling the encoding () call helps?

      Setting quotation to undef will cause any field that needs quotation, like NL or fields that contain sep_char to generate invalid CSV.

      Enjoy, Have FUN! H.Merijn

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://932179]
Approved by BrowserUk
Front-paged by BrowserUk
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2016-08-26 15:26 GMT
Find Nodes?
    Voting Booth?
    The best thing I ever won in a lottery was:

    Results (372 votes). Check out past polls.