Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Problems? Is your data what you think it is?

Re^4: CSV_XS and UTF8 strings

by beerman (Novice)
on Oct 19, 2011 at 16:46 UTC ( #932445=note: print w/ replies, xml ) Need Help??

in reply to Re^3: CSV_XS and UTF8 strings
in thread CSV_XS and UTF8 strings

No, I am not asking at all for an option to disable the need for quotation on characters with code-points > 127. I expect consistent functionality regardless of the characters used. As illustrated if I have an input file such as

this is field 1,this is field 2, this is field 3
My script will write a new file that is exactly the same as the input file. Now if I change the ASCII letter 'e' to e with acute (U+00E9),
this is fiéld 1,this is fiéld 2, this is fiéld 3
guess what, it still works the same as with just ASCII. That is the output file is exactly the same as the input file. The output file created with my script has no double quotes around any field. It looks the same as the input file. But, if I add one Japanese character to any one of the items, that item with the Japanese character will have double quotes around it in the new file. So the inconsistency is even worse then I expected as the command "quote_space => 0 does work for some characters above 0x7F but not for all characters. My data file is UTF8 so the e acute is two bytes in UTF8 where as the Japanese character is 3 bytes but again, I'd like to think that all UTF8 data is treated the same. In conclusion, I want properly formatted CSV, I expect double quotes around strings when needed but my testing shows that there is a lot of inconsistency with the use of quote_space => 0 depending on the type of characters in the string.

Did some more testing and it appears that the characters that "quote_space => 0" works properly are printable characters in the range 0x00 - 0xFF (Basic Latin and Latin 1 Supplement). Fields with a character above 0x0100 (starts with Latin Extended A) will always get double quotes around them regardless if needed or not. So the function quote_space => 0 stops working as expected with characters starting at 0x0100.

Comment on Re^4: CSV_XS and UTF8 strings
Select or Download Code
Re^5: CSV_XS and UTF8 strings
by Tux (Monsignor) on Oct 19, 2011 at 17:35 UTC

    You obviously do not read my replies, or you do not understand them (at all).

    Install Text::CSV_XS from this archive, and call your constructor as:

    my $csv = text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, quote_binary => 0, });

    and I am sure you gat a long way towards your (wrong) expectation of what CSV should be.

    What you describe as wrong is expected and correct behavior. The fact that it doesn't look like the original is quite something else. Text::CSV_XS and Text::CSV offer a plethora of options and attributes to make it (more) behave as end-users expect or want it to behave, but the default is correct, even if it does not produce exactly what the source happened to be.

    If it still doesn't fit your needs, and my new attribute is still unsatisfactory for your idea of correctness, I suppose you will have to look for handcrafted solutions and not use Text::CSV_XS.

    Enjoy, Have FUN! H.Merijn
      Tux, thanks for the code. It looks like it will work and I'll try it out tomorrow. I've developed with Unicode in many programming languages and what confuses me is the term binary. I'm not sure if this is unique to Text::CSV_XS or if it related to all perl modules. This is a strange term when speaking of Unicode characters. Thus, I looked up what is really meant by binary and according to the Text::CSV_XS documentation binary is when any byte in the range \x00-\x08,\x10-\x1F,\x7F-\xFF is found but the range you specified in the trigger is (c >= 0x7f && c <= 0xa0). The documentation goes on to say that "If a string is marked UTF8, binary will be turned on automatically when binary characters other than CR or NL are encountered. Note that a simple string like "\x{00a0}" might still be binary, but not marked UTF8, so setting { binary = 1 }> is still a wise option.". I was using the binary = 1 setting along with quote_space => 0, so from a Unicode development standpoint there is a strange behavior when the results differ for a simple CSV file with just one row, one field such as
      This is X test
      If X is a printable character in the range from 0x0000 to 0x00FF, the quote_space => 0 works as expected and the field will not have double quotes in the output but if the X is a printable character > 0x00FF the field has double quotes around it. All Unicode developers expect the behavior of quote_space => 0 to be the same for all printable characters. Unicode people think that a letter is a letter regardless of what writing script (Hangul, Han, Cyrillic, Hebrew, etc.) it comes from. It is even stranger that the setting quote_space => 0 works with characters from Basic & Latin1 but then it starts to act differently once you get into the Unicode block for Extended Latin A.

      Again, thanks for going the extra mile and providing the new constructor.

        You should pay attention to the two sides of the coin:

        • Parsing: Text::CSV_XS is created to do safe, reliable, and fast parsing of CSV data. The constructor supports many attributes to control the parsing of CSV data that is formatted outside of the default allowable small definition. The most common used attribute will be sep_char to allow for all the different non-standard seperation characters used by M$-Excel which uses the "list separation character" from the locale setting instead of the default comma when exporting to CSV. "The string is marked UTF8" only applies to this side of the coin: when reading CSV.
        • Writing: many of the attributes only apply to parsing, some apply only to writing. The quote_space is one of them and has no influence whatsoever on parsing data.

        Text::CSV_XS parses and writes bytes, not characters or letters. The "upgrade" to Unicode/UTF-8 only applies to the moment a field is correctly parsed and detected "binary" inside that field. When dealing with Unicode (in whatever encoding), you are absolutely sure that the text both in parsing and writing will contain "binary" bytes so you should always set that attribute. The fact that it is not default stems from the distant past. Setting that to a sane default of 1 could possible break backward compatibility.

        In writing both whitespace and "binary" bytes will trigger quotation. Please don't mix quote_space (controlling quotation on whitespace) with quote_binary (controlling binary quotation - the new attribute), so what you perceive as "strange" is just a misconception of your understanding of the quote_space attribute.

        Enjoy, Have FUN! H.Merijn

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://932445]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-04-20 04:40 GMT
Find Nodes?
    Voting Booth?

    April first is:

    Results (485 votes), past polls