Re^6: CSV_XS and UTF8 strings

Tux, thanks for the code. It looks like it will work and I'll try it out tomorrow. I've developed with Unicode in many programming languages and what confuses me is the term binary. I'm not sure if this is unique to Text::CSV_XS or if it related to all perl modules. This is a strange term when speaking of Unicode characters. Thus, I looked up what is really meant by binary and according to the Text::CSV_XS documentation binary is when any byte in the range \x00-\x08,\x10-\x1F,\x7F-\xFF is found but the range you specified in the trigger is (c >= 0x7f && c <= 0xa0). The documentation goes on to say that "If a string is marked UTF8, binary will be turned on automatically when binary characters other than CR or NL are encountered. Note that a simple string like "\x{00a0}" might still be binary, but not marked UTF8, so setting { binary = 1 }> is still a wise option.". I was using the binary = 1 setting along with quote_space => 0, so from a Unicode development standpoint there is a strange behavior when the results differ for a simple CSV file with just one row, one field such as

This is X test
[download]

If X is a printable character in the range from 0x0000 to 0x00FF, the quote_space => 0 works as expected and the field will not have double quotes in the output but if the X is a printable character > 0x00FF the field has double quotes around it. All Unicode developers expect the behavior of quote_space => 0 to be the same for all printable characters. Unicode people think that a letter is a letter regardless of what writing script (Hangul, Han, Cyrillic, Hebrew, etc.) it comes from. It is even stranger that the setting quote_space => 0 works with characters from Basic & Latin1 but then it starts to act differently once you get into the Unicode block for Extended Latin A.

Again, thanks for going the extra mile and providing the new constructor.

Comment on Re^6: CSV_XS and UTF8 strings Download Code

Replies are listed 'Best First'.

Re^7: CSV_XS and UTF8 strings
by Tux (Canon) on Oct 20, 2011 at 06:58 UTC

You should pay attention to the two sides of the coin:

Parsing: Text::CSV_XS is created to do safe, reliable, and fast parsing of CSV data. The constructor supports many attributes to control the parsing of CSV data that is formatted outside of the default allowable small definition. The most common used attribute will be sep_char to allow for all the different non-standard seperation characters used by M$-Excel which uses the "list separation character" from the locale setting instead of the default comma when exporting to CSV. "The string is marked UTF8" only applies to this side of the coin: when reading CSV.
Writing: many of the attributes only apply to parsing, some apply only to writing. The quote_space is one of them and has no influence whatsoever on parsing data.

Text::CSV_XS parses and writes bytes, not characters or letters. The "upgrade" to Unicode/UTF-8 only applies to the moment a field is correctly parsed and detected "binary" inside that field. When dealing with Unicode (in whatever encoding), you are absolutely sure that the text both in parsing and writing will contain "binary" bytes so you should always set that attribute. The fact that it is not default stems from the distant past. Setting that to a sane default of 1 could possible break backward compatibility.

In writing both whitespace and "binary" bytes will trigger quotation. Please don't mix quote_space (controlling quotation on whitespace) with quote_binary (controlling binary quotation - the new attribute), so what you perceive as "strange" is just a misconception of your understanding of the quote_space attribute.

Enjoy, Have FUN! H.Merijn

[reply]
[d/l]
[select]


Just another Perl shrine
	PerlMonks