Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^4: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

by Jim (Curate)
on Dec 10, 2012 at 06:20 UTC ( [id://1008053]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation

\xFE is 254, not 255. It's the second biggest a byte gets. \xFF is the biggest byte.

In any case, the character with code point U+00FE isn't a single-byte character in any Unicode character encoding scheme.

I suspect the peculiar reference to U+00FE in the documentation has something to do with the Concordance DAT file. I hope it does, because it would then imply that Text::CSV_XS can be used to parse Concordance DAT records, which is precisely what I need to do.

Jim

  • Comment on Re^4: Peculiar Reference To U+00FE In Text::CSV_XS Documentation

Replies are listed 'Best First'.
Re^5: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Tux (Canon) on Dec 10, 2012 at 07:56 UTC

    I do not recall exactly why the docs were written the way they were, but as I am unaware of the DAT format, I cannot verify that U+00FE was referred to because of this format.

    Knowing my own way of thinking, it most likely is not 0xFF, as that would be -1, which could be used as a guard marker or something alike (currently it isn't). I might have used 0xFE as it is the next highest byte.

    I have just read your post, and the only conflict I see that Text::CSV_XS is not able to do is optional line endings. The optional <CR> before the <NL> is automatically dealt with (just do not specify eol), but you cannot have an extra U+00AE to also end records. If otoh 0xAE is just a placeholder for embedded newlines, that is easy to do (see below).

    Another point of care is that Text::CSV_XS does not deal with BOM's, so you'll need File::BOM or other means to deal with that.

    my $csv = Text::CSV_XS->new ({ sep_char => "\x{14}", quote_char => "\x{fe}", escape_char => undef, binary => 1, auto_diag => 1, }); while (my $row = $csv->getline ($fh)) { tr/\x{ae}/\n/ for @$row; # continue as usual }

    If it doesn't, I'd like to see some data.

    Note that the encoded U+00FE is 0xC3BE, which is two bytes, and two bytes cannot be used as a sep_char in Text::CSV_XS, which parses the data as bytes, so the stream has to be properly coded before parsing.


    Enjoy, Have FUN! H.Merijn

      Thank you, Tux, for your reply.

      If otoh 0xAE is just a placeholder for embedded newlines, that is easy to do (see below).

      Yes, U+00AE (®, REGISTERED SIGN, 0xC2 0xAE in UTF-8) is used as a placeholder for literal newlines in quoted strings. The CSV records in Concordance DAT files are ordinary ones with standard EOL characters:  <CR><LF> pairs.

      Another point of care is that Text::CSV_XS does not deal with BOM's, so you'll need File::BOM or other means to deal with that.

      This would be a nice feature to add to Text::CSV_XS:  proper handling of Unicode byte order marks in UTF-8, UTF-16 and UTF-32 CSV files.

      Note that the encoded U+00FE is 0xC3BE, which is two bytes, and two bytes cannot be used as a sep_char in Text::CSV_XS, which parses the data as bytes, so the stream has to be properly coded before parsing.

      This settles it. It's not the answer I'd hoped for, but I'm glad to know now with certainty that Text::CSV_XS cannot parse a UTF-8 Concordance DAT file. I'll stop trying hopelessly to make it work. ;-)

      How difficult would it be to enhance Text::CSV_XS to handle metacharacters in Unicode CSV files that are outside the Basic Latin block (i.e., not ASCII characters)? The Concordance DAT file is a de facto standard format for data interchange in the litigation support and e-discovery industry. As I've explained, the only thing special about it is the unusual and unfortunate characters it uses for metacharacters:  U+0014, which is a control code; U+00FE, which is word constituent character; and U+00AE, which is a common character in ordinary text.

      Jim

        1. Newlines

          If U+00AE is just a placeholder for newlines *inside* fields, my proposed solution works fine.

        2. BOM

          I have been playing with thoughts about BOM handling quite a few times already, but came to the same conclusion time after time: the advantage is not worth the performance penalty, which is huge.

          Text::CSV_XS is written for sheer speed, and having to check BOM on every record-start (yes, eventually that is what it turns out to be if one wants to support streams) is not worth it. It is relatively easy to

          • Do BOM handling before Text::CSV_XS starts parsing
          • Write a wrapper or a super-class that does BOM handling

        3. Non-ASCII characters for sep/quote/escape

          Any of these will imply a speed penalty, even if I would allow it and implement it. That is because the parser is a state machine, which means that the internal structure should change to both allowing multi-byte characters and handling them (1st check on start of each of them, then read-ahead if the next is part of the "character" and so on. I already allow this on eol up to 8 characters, which was a pain in the ass to do safely. I'm not saying it is impossible, but I'm not sure if it is worth development time.

          You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).


        Enjoy, Have FUN! H.Merijn
        character

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1008053]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-19 11:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found