Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Jim (Curate)
on Dec 10, 2012 at 04:33 UTC ( #1008033=note: print w/ replies, xml ) Need Help??


in reply to Re: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation

Thanks. I tested parsing a UTF-8 Concordance DAT file using csv-check. It doesn't work.

I don't understand your explanation of why there is a reference to the Unicode character with code point U+00FE in the Text::CSV_XS documentation. Why that character? I suspect only Tux, the maintainer of the module, can explain the mysterious reference to it.

The documentation is ambiguous with regard to whether or not Text::CSV_XS can parse CSV records that use multi-byte metacharacters. It says it "may or may not work as expected," and it also explicitly states that U+00FE, which is a multi-byte character (the two bytes \xC3\xBE in UTF-8), "will be allowed as a quote character." It's this very ambiguity that is the basis of my inquiry here.

Jim


Comment on Re^2: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
Replies are listed 'Best First'.
Re^3: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Anonymous Monk on Dec 10, 2012 at 05:42 UTC

    I don't understand ... the documentation is ambiguous

    But did you understand what I said? What number is it?

    In the source

    #define byte unsigned char typedef struct { byte quote_char; byte escape_char; byte sep_char;

    255 is the biggest a byte gets, right?

    :D

      \xFE is 254, not 255. It's the second biggest a byte gets. \xFF is the biggest byte.

      In any case, the character with code point U+00FE isn't a single-byte character in any Unicode character encoding scheme.

      I suspect the peculiar reference to U+00FE in the documentation has something to do with the Concordance DAT file. I hope it does, because it would then imply that Text::CSV_XS can be used to parse Concordance DAT records, which is precisely what I need to do.

      Jim

        I do not recall exactly why the docs were written the way they were, but as I am unaware of the DAT format, I cannot verify that U+00FE was referred to because of this format.

        Knowing my own way of thinking, it most likely is not 0xFF, as that would be -1, which could be used as a guard marker or something alike (currently it isn't). I might have used 0xFE as it is the next highest byte.

        I have just read your post, and the only conflict I see that Text::CSV_XS is not able to do is optional line endings. The optional <CR> before the <NL> is automatically dealt with (just do not specify eol), but you cannot have an extra U+00AE to also end records. If otoh 0xAE is just a placeholder for embedded newlines, that is easy to do (see below).

        Another point of care is that Text::CSV_XS does not deal with BOM's, so you'll need File::BOM or other means to deal with that.

        my $csv = Text::CSV_XS->new ({ sep_char => "\x{14}", quote_char => "\x{fe}", escape_char => undef, binary => 1, auto_diag => 1, }); while (my $row = $csv->getline ($fh)) { tr/\x{ae}/\n/ for @$row; # continue as usual }

        If it doesn't, I'd like to see some data.

        Note that the encoded U+00FE is 0xC3BE, which is two bytes, and two bytes cannot be used as a sep_char in Text::CSV_XS, which parses the data as bytes, so the stream has to be properly coded before parsing.


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1008033]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2015-07-31 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (276 votes), past polls