Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: CSV_XS and UTF8 strings

by Tux (Monsignor)
on Oct 19, 2011 at 06:44 UTC ( #932304=note: print w/ replies, xml ) Need Help??


in reply to Re^2: CSV_XS and UTF8 strings
in thread CSV_XS and UTF8 strings

So what you want is a new option to disable the need for quotation on characters with code-points > 127?

Note that the quote_space isn't even tested when writing the fields with the utf-8 characters. It is just tested when a space is encountered inside a field. While scanning a field, there is a flag that is set when quotation is required. When the flag has been set already by whatever other trigger, further tests are skipped. In your example that flag was already triggered by the first "binary" character, so the quote_space is effectively a no-op in your code.

I'm however not sure that I want to implement such a new feature as it will potentially create invalid CSV. OTOH it will be an option that is only used on writing CSV, which is relatively easy to change.

The current quote trigger is like:

if (c < csv->first_safe_char || (c >= 0x7f && c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

A new flag could make that into something like

if (c < csv->first_safe_char || (csv->quote_binary && c >= 0x7f && + c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

Leaving it safe for all ASCII binary. I could do that.

update done

Text-CSV_XS $ cat test.pl use strict; use warnings; binmode STDOUT, ":utf8"; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n +" }); $csv->quote_binary (1); # default $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); $csv->quote_binary (0); $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); Text-CSV_XS $ perl -Iblib/{lib,arch} test.pl ,," ",1,"a b ","€" ,," ",1,"a b ",€ Text-CSV_XS $

Enjoy, Have FUN! H.Merijn


Comment on Re^3: CSV_XS and UTF8 strings
Select or Download Code
Re^4: CSV_XS and UTF8 strings
by beerman (Novice) on Oct 19, 2011 at 16:46 UTC
    No, I am not asking at all for an option to disable the need for quotation on characters with code-points > 127. I expect consistent functionality regardless of the characters used. As illustrated if I have an input file such as
    this is field 1,this is field 2, this is field 3
    My script will write a new file that is exactly the same as the input file. Now if I change the ASCII letter 'e' to e with acute (U+00E9),
    this is fiéld 1,this is fiéld 2, this is fiéld 3
    guess what, it still works the same as with just ASCII. That is the output file is exactly the same as the input file. The output file created with my script has no double quotes around any field. It looks the same as the input file. But, if I add one Japanese character to any one of the items, that item with the Japanese character will have double quotes around it in the new file. So the inconsistency is even worse then I expected as the command "quote_space => 0 does work for some characters above 0x7F but not for all characters. My data file is UTF8 so the e acute is two bytes in UTF8 where as the Japanese character is 3 bytes but again, I'd like to think that all UTF8 data is treated the same. In conclusion, I want properly formatted CSV, I expect double quotes around strings when needed but my testing shows that there is a lot of inconsistency with the use of quote_space => 0 depending on the type of characters in the string.

    Did some more testing and it appears that the characters that "quote_space => 0" works properly are printable characters in the range 0x00 - 0xFF (Basic Latin and Latin 1 Supplement). Fields with a character above 0x0100 (starts with Latin Extended A) will always get double quotes around them regardless if needed or not. So the function quote_space => 0 stops working as expected with characters starting at 0x0100.

      You obviously do not read my replies, or you do not understand them (at all).

      Install Text::CSV_XS from this archive, and call your constructor as:

      my $csv = text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, quote_binary => 0, });

      and I am sure you gat a long way towards your (wrong) expectation of what CSV should be.

      What you describe as wrong is expected and correct behavior. The fact that it doesn't look like the original is quite something else. Text::CSV_XS and Text::CSV offer a plethora of options and attributes to make it (more) behave as end-users expect or want it to behave, but the default is correct, even if it does not produce exactly what the source happened to be.

      If it still doesn't fit your needs, and my new attribute is still unsatisfactory for your idea of correctness, I suppose you will have to look for handcrafted solutions and not use Text::CSV_XS.


      Enjoy, Have FUN! H.Merijn
        Tux, thanks for the code. It looks like it will work and I'll try it out tomorrow. I've developed with Unicode in many programming languages and what confuses me is the term binary. I'm not sure if this is unique to Text::CSV_XS or if it related to all perl modules. This is a strange term when speaking of Unicode characters. Thus, I looked up what is really meant by binary and according to the Text::CSV_XS documentation binary is when any byte in the range \x00-\x08,\x10-\x1F,\x7F-\xFF is found but the range you specified in the trigger is (c >= 0x7f && c <= 0xa0). The documentation goes on to say that "If a string is marked UTF8, binary will be turned on automatically when binary characters other than CR or NL are encountered. Note that a simple string like "\x{00a0}" might still be binary, but not marked UTF8, so setting { binary = 1 }> is still a wise option.". I was using the binary = 1 setting along with quote_space => 0, so from a Unicode development standpoint there is a strange behavior when the results differ for a simple CSV file with just one row, one field such as
        This is X test
        If X is a printable character in the range from 0x0000 to 0x00FF, the quote_space => 0 works as expected and the field will not have double quotes in the output but if the X is a printable character > 0x00FF the field has double quotes around it. All Unicode developers expect the behavior of quote_space => 0 to be the same for all printable characters. Unicode people think that a letter is a letter regardless of what writing script (Hangul, Han, Cyrillic, Hebrew, etc.) it comes from. It is even stranger that the setting quote_space => 0 works with characters from Basic & Latin1 but then it starts to act differently once you get into the Unicode block for Extended Latin A.

        Again, thanks for going the extra mile and providing the new constructor.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://932304]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2014-08-23 08:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (173 votes), past polls