Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re: CSV_XS and UTF8 strings

by Tux (Abbot)
on Oct 18, 2011 at 17:29 UTC ( #932200=note: print w/replies, xml ) Need Help??

in reply to CSV_XS and UTF8 strings

quote_space is just to decide if values that have space(s) are to be considered for quotation and has nothing to do with anything else.

$ perl -MText::CSV_XS -wE'Text::CSV_XS->new({binary=>1,auto_diag=>1,qu +ote_space=>0,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b "])' ,, ,1,a b $ perl -MText::CSV_XS -wE'Text::CSV_XS->new({binary=>1,auto_diag=>1,qu +ote_space=>1,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b "])' ,," ",1,"a b "

There is no option (yet) to prevent quotation for (valid) UTF8 other than the aforementioned undef, which will disable quotation altogether, which is wrong in most cases.

$ perl -MText::CSV_XS -C3 -wE'Text::CSV_XS->new({binary=>1,auto_diag=> +1,quote_space=>0,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b ","\ +x{20ac}"])' ,, ,1,a b ,"€" $ perl -MText::CSV_XS -C3 -wE'Text::CSV_XS->new({binary=>1,auto_diag=> +1,quote_space=>1,eol=>"\n"})->print(*STDOUT,[undef,""," ",1,"a b ","\ +x{20ac}"])' ,," ",1,"a b ","€"

Would you be so kind to explain why quotation around Unicode strings is wrong in your perception? In theory, quotation never harms.

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: CSV_XS and UTF8 strings
by beerman (Novice) on Oct 18, 2011 at 18:42 UTC
    I understand the rules for proper CSV formats and thus know that putting double quotes around strings with spaces is correct according to these CSV formatting rules. My concern is that the original CSV file does not have any double quotes around strings with spaces. This is an English Resource file and I'm creating a Japanese resource source file. The concern is that the program reading the CSV files may have problems when it encounters the double quotes around the Japanese string since the original English string did not have these. I know I can then tell the developer that the program should be able to handle properly formatted CSV but it is a hassle working with the developers so if I could create the Japanese CSV with same formatting than I won't have to worry about whether their program works with the double quotes around the Japanese string. I also do a lot of work with Unicode and do get frustrated when there are inconsistencies across languages. Characters are characters and it should not matter what language. Unfortunately, there is an inconsistency with the use of "quote_space => 0". As demonstrated in my data examples, a data file with just English (ASCII characters) processed by my script results in exactly the same format. That means if a string with spaces did not have quotes, the new file carries over this same format BUT if the data file has Unicode (UTF8) characters with spaces than the formatting changes and double quotes are added to this string even though the purpose of "quote_space => 0" is to not add these quotes.

      Maybe you would be happier with just join ",", ... instead of a module meant to produce properly formatted CSV (since you don't seem to actually want properly formatted CSV). Your desired output doesn't sound even close to rocket surgery, so I don't see much point requiring the module.

      - tye        

      So what you want is a new option to disable the need for quotation on characters with code-points > 127?

      Note that the quote_space isn't even tested when writing the fields with the utf-8 characters. It is just tested when a space is encountered inside a field. While scanning a field, there is a flag that is set when quotation is required. When the flag has been set already by whatever other trigger, further tests are skipped. In your example that flag was already triggered by the first "binary" character, so the quote_space is effectively a no-op in your code.

      I'm however not sure that I want to implement such a new feature as it will potentially create invalid CSV. OTOH it will be an option that is only used on writing CSV, which is relatively easy to change.

      The current quote trigger is like:

      if (c < csv->first_safe_char || (c >= 0x7f && c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

      A new flag could make that into something like

      if (c < csv->first_safe_char || (csv->quote_binary && c >= 0x7f && + c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

      Leaving it safe for all ASCII binary. I could do that.

      update done

      Text-CSV_XS $ cat use strict; use warnings; binmode STDOUT, ":utf8"; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n +" }); $csv->quote_binary (1); # default $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); $csv->quote_binary (0); $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); Text-CSV_XS $ perl -Iblib/{lib,arch} ,," ",1,"a b ","€" ,," ",1,"a b ",€ Text-CSV_XS $

      Enjoy, Have FUN! H.Merijn
        No, I am not asking at all for an option to disable the need for quotation on characters with code-points > 127. I expect consistent functionality regardless of the characters used. As illustrated if I have an input file such as
        this is field 1,this is field 2, this is field 3
        My script will write a new file that is exactly the same as the input file. Now if I change the ASCII letter 'e' to e with acute (U+00E9),
        this is fiéld 1,this is fiéld 2, this is fiéld 3
        guess what, it still works the same as with just ASCII. That is the output file is exactly the same as the input file. The output file created with my script has no double quotes around any field. It looks the same as the input file. But, if I add one Japanese character to any one of the items, that item with the Japanese character will have double quotes around it in the new file. So the inconsistency is even worse then I expected as the command "quote_space => 0 does work for some characters above 0x7F but not for all characters. My data file is UTF8 so the e acute is two bytes in UTF8 where as the Japanese character is 3 bytes but again, I'd like to think that all UTF8 data is treated the same. In conclusion, I want properly formatted CSV, I expect double quotes around strings when needed but my testing shows that there is a lot of inconsistency with the use of quote_space => 0 depending on the type of characters in the string.

        Did some more testing and it appears that the characters that "quote_space => 0" works properly are printable characters in the range 0x00 - 0xFF (Basic Latin and Latin 1 Supplement). Fields with a character above 0x0100 (starts with Latin Extended A) will always get double quotes around them regardless if needed or not. So the function quote_space => 0 stops working as expected with characters starting at 0x0100.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://932200]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2018-06-18 09:46 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (109 votes). Check out past polls.