Peculiar Reference To U+00FE In Text::CSV

Jim has asked for the wisdom of the Perl Monks concerning the following question:

In the documentation of Text::CSV_XS, there's a peculiar reference to what seems like a very special case:

The separation-, escape- [sic], and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters outside this range may or may not work as expected. … If you use perl-5.8.2 or higher, these three attributes are utf8-decoded, to increase the likelihood of success. This way U+00FE will be allowed as a quote character. [My emphasis.]

Why is this particular Unicode character, LATIN SMALL LETTER THORN, singled out for special mention in the documentation? And why does it state that "[c]haracters outside [the range from 0x20 through 0x7E] may or may not work as expected"? When might they work?

The implication of this explicit mentioning of U+00FE in the documentation is that Text::CSV_XS can be used to parse CSV records in Unicode Concordance DAT files. If this is the case, then I want to learn how to do this. (See my earlier post titled Best Way To Parse Concordance DAT File Using Modern Perl?)

Jim

Comment on Peculiar Reference To U+00FE In Text::CSV_XS Documentation

Replies are listed 'Best First'.

Re: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Anonymous Monk on Dec 10, 2012 at 03:02 UTC

Why is this particular Unicode character, LATIN SMALL LETTER THORN, singled out for special mention in the documentation?

Because of its ordinal value (the number that it is)

When might they work?

:) When they're not on vacation?

:) When the source allows it?

Seriously though, the docs you're quoting say it Multibyte characters are not allowd and use perl-5.8.2 or higher

If this is the case, then I want to learn how to do this

What are you waiting for?

examples/csv-check Script to check a CSV file/stream

examples/csvdiff Script to shoff diff between sorted CSV files

examples/parser-xs.pl Parse CSV stream, be forgiving on bad lines

examples/speed.pl Small benchmark script

[reply]

Re^2: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Jim (Curate) on Dec 10, 2012 at 04:33 UTC

Thanks. I tested parsing a UTF-8 Concordance DAT file using csv-check. It doesn't work.

I don't understand your explanation of why there is a reference to the Unicode character with code point U+00FE in the Text::CSV_XS documentation. Why that character? I suspect only Tux, the maintainer of the module, can explain the mysterious reference to it.

The documentation is ambiguous with regard to whether or not Text::CSV_XS can parse CSV records that use multi-byte metacharacters. It says it "may or may not work as expected," and it also explicitly states that U+00FE, which is a multi-byte character (the two bytes \xC3\xBE in UTF-8), "will be allowed as a quote character." It's this very ambiguity that is the basis of my inquiry here.

Jim

[reply]

Re^3: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Anonymous Monk on Dec 10, 2012 at 05:42 UTC

I don't understand ... the documentation is ambiguous

But did you understand what I said? What number is it?

In the source

#define    byte    unsigned char
typedef struct {
    byte    quote_char;
    byte    escape_char;
    byte    sep_char;
[download]

255 is the biggest a byte gets, right?

[reply]
[d/l]

Re^4: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Jim (Curate) on Dec 10, 2012 at 06:20 UTC

Re^5: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Tux (Canon) on Dec 10, 2012 at 07:56 UTC

Some notes below your chosen depth have not been shown here

Back to Seekers of Perl Wisdom