http://www.perlmonks.org?node_id=1008017

Jim has asked for the wisdom of the Perl Monks concerning the following question:

In the documentation of Text::CSV_XS, there's a peculiar reference to what seems like a very special case:

The separation-, escape- [sic], and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters outside this range may or may not work as expected. … If you use perl-5.8.2 or higher, these three attributes are utf8-decoded, to increase the likelihood of success. This way U+00FE will be allowed as a quote character. [My emphasis.]

Why is this particular Unicode character, LATIN SMALL LETTER THORN, singled out for special mention in the documentation? And why does it state that "[c]haracters outside [the range from 0x20 through 0x7E] may or may not work as expected"? When might they work?

The implication of this explicit mentioning of U+00FE in the documentation is that Text::CSV_XS can be used to parse CSV records in Unicode Concordance DAT files. If this is the case, then I want to learn how to do this. (See my earlier post titled Best Way To Parse Concordance DAT File Using Modern Perl?)

Jim

  • Comment on Peculiar Reference To U+00FE In Text::CSV_XS Documentation

Replies are listed 'Best First'.
Re: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Anonymous Monk on Dec 10, 2012 at 03:02 UTC

    Why is this particular Unicode character, LATIN SMALL LETTER THORN, singled out for special mention in the documentation?

    Because of its ordinal value (the number that it is)

    When might they work?

    :) When they're not on vacation?

    :) When the source allows it?

    Seriously though, the docs you're quoting say it Multibyte characters are not allowd and use perl-5.8.2 or higher

     

    If this is the case, then I want to learn how to do this

    What are you waiting for?
    examples/csv-check Script to check a CSV file/stream
    examples/csvdiff Script to shoff diff between sorted CSV files
    examples/parser-xs.pl Parse CSV stream, be forgiving on bad lines
    examples/speed.pl Small benchmark script

      Thanks. I tested parsing a UTF-8 Concordance DAT file using csv-check. It doesn't work.

      I don't understand your explanation of why there is a reference to the Unicode character with code point U+00FE in the Text::CSV_XS documentation. Why that character? I suspect only Tux, the maintainer of the module, can explain the mysterious reference to it.

      The documentation is ambiguous with regard to whether or not Text::CSV_XS can parse CSV records that use multi-byte metacharacters. It says it "may or may not work as expected," and it also explicitly states that U+00FE, which is a multi-byte character (the two bytes \xC3\xBE in UTF-8), "will be allowed as a quote character." It's this very ambiguity that is the basis of my inquiry here.

      Jim

        I don't understand ... the documentation is ambiguous

        But did you understand what I said? What number is it?

        In the source

        #define byte unsigned char typedef struct { byte quote_char; byte escape_char; byte sep_char;

        255 is the biggest a byte gets, right?

        :D