Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: CSV_XS and UTF8 strings

by beerman (Novice)
on Oct 18, 2011 at 18:42 UTC ( [id://932216]=note: print w/replies, xml ) Need Help??


in reply to Re: CSV_XS and UTF8 strings
in thread CSV_XS and UTF8 strings

I understand the rules for proper CSV formats and thus know that putting double quotes around strings with spaces is correct according to these CSV formatting rules. My concern is that the original CSV file does not have any double quotes around strings with spaces. This is an English Resource file and I'm creating a Japanese resource source file. The concern is that the program reading the CSV files may have problems when it encounters the double quotes around the Japanese string since the original English string did not have these. I know I can then tell the developer that the program should be able to handle properly formatted CSV but it is a hassle working with the developers so if I could create the Japanese CSV with same formatting than I won't have to worry about whether their program works with the double quotes around the Japanese string. I also do a lot of work with Unicode and do get frustrated when there are inconsistencies across languages. Characters are characters and it should not matter what language. Unfortunately, there is an inconsistency with the use of "quote_space => 0". As demonstrated in my data examples, a data file with just English (ASCII characters) processed by my script results in exactly the same format. That means if a string with spaces did not have quotes, the new file carries over this same format BUT if the data file has Unicode (UTF8) characters with spaces than the formatting changes and double quotes are added to this string even though the purpose of "quote_space => 0" is to not add these quotes.

Replies are listed 'Best First'.
Re^3: CSV_XS and UTF8 strings (join)
by tye (Sage) on Oct 18, 2011 at 18:57 UTC

    Maybe you would be happier with just join ",", ... instead of a module meant to produce properly formatted CSV (since you don't seem to actually want properly formatted CSV). Your desired output doesn't sound even close to rocket surgery, so I don't see much point requiring the module.

    - tye        

Re^3: CSV_XS and UTF8 strings
by Tux (Canon) on Oct 19, 2011 at 06:44 UTC

    So what you want is a new option to disable the need for quotation on characters with code-points > 127?

    Note that the quote_space isn't even tested when writing the fields with the utf-8 characters. It is just tested when a space is encountered inside a field. While scanning a field, there is a flag that is set when quotation is required. When the flag has been set already by whatever other trigger, further tests are skipped. In your example that flag was already triggered by the first "binary" character, so the quote_space is effectively a no-op in your code.

    I'm however not sure that I want to implement such a new feature as it will potentially create invalid CSV. OTOH it will be an option that is only used on writing CSV, which is relatively easy to change.

    The current quote trigger is like:

    if (c < csv->first_safe_char || (c >= 0x7f && c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

    A new flag could make that into something like

    if (c < csv->first_safe_char || (csv->quote_binary && c >= 0x7f && + c <= 0xa0) || (csv->quote_char && c == csv->quote_char) || (csv->sep_char && c == csv->sep_char) || (csv->escape_char && c == csv->escape_char)) { /* Binary character */ break; }

    Leaving it safe for all ASCII binary. I could do that.

    update done

    Text-CSV_XS $ cat test.pl use strict; use warnings; binmode STDOUT, ":utf8"; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n +" }); $csv->quote_binary (1); # default $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); $csv->quote_binary (0); $csv->print (*STDOUT, [ undef, "", " ", 1, "a b ", "\x{20ac}" ]); Text-CSV_XS $ perl -Iblib/{lib,arch} test.pl ,," ",1,"a b ","€" ,," ",1,"a b ",€ Text-CSV_XS $

    Enjoy, Have FUN! H.Merijn
      No, I am not asking at all for an option to disable the need for quotation on characters with code-points > 127. I expect consistent functionality regardless of the characters used. As illustrated if I have an input file such as
      this is field 1,this is field 2, this is field 3
      My script will write a new file that is exactly the same as the input file. Now if I change the ASCII letter 'e' to e with acute (U+00E9),
      this is fiéld 1,this is fiéld 2, this is fiéld 3
      guess what, it still works the same as with just ASCII. That is the output file is exactly the same as the input file. The output file created with my script has no double quotes around any field. It looks the same as the input file. But, if I add one Japanese character to any one of the items, that item with the Japanese character will have double quotes around it in the new file. So the inconsistency is even worse then I expected as the command "quote_space => 0 does work for some characters above 0x7F but not for all characters. My data file is UTF8 so the e acute is two bytes in UTF8 where as the Japanese character is 3 bytes but again, I'd like to think that all UTF8 data is treated the same. In conclusion, I want properly formatted CSV, I expect double quotes around strings when needed but my testing shows that there is a lot of inconsistency with the use of quote_space => 0 depending on the type of characters in the string.

      Did some more testing and it appears that the characters that "quote_space => 0" works properly are printable characters in the range 0x00 - 0xFF (Basic Latin and Latin 1 Supplement). Fields with a character above 0x0100 (starts with Latin Extended A) will always get double quotes around them regardless if needed or not. So the function quote_space => 0 stops working as expected with characters starting at 0x0100.

        You obviously do not read my replies, or you do not understand them (at all).

        Install Text::CSV_XS from this archive, and call your constructor as:

        my $csv = text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, quote_binary => 0, });

        and I am sure you gat a long way towards your (wrong) expectation of what CSV should be.

        What you describe as wrong is expected and correct behavior. The fact that it doesn't look like the original is quite something else. Text::CSV_XS and Text::CSV offer a plethora of options and attributes to make it (more) behave as end-users expect or want it to behave, but the default is correct, even if it does not produce exactly what the source happened to be.

        If it still doesn't fit your needs, and my new attribute is still unsatisfactory for your idea of correctness, I suppose you will have to look for handcrafted solutions and not use Text::CSV_XS.


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://932216]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2024-04-18 08:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found