http://www.perlmonks.org?node_id=1008144


in reply to Re^6: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation

  1. Newlines

    If U+00AE is just a placeholder for newlines *inside* fields, my proposed solution works fine.

  2. BOM

    I have been playing with thoughts about BOM handling quite a few times already, but came to the same conclusion time after time: the advantage is not worth the performance penalty, which is huge.

    Text::CSV_XS is written for sheer speed, and having to check BOM on every record-start (yes, eventually that is what it turns out to be if one wants to support streams) is not worth it. It is relatively easy to

    • Do BOM handling before Text::CSV_XS starts parsing
    • Write a wrapper or a super-class that does BOM handling

  3. Non-ASCII characters for sep/quote/escape

    Any of these will imply a speed penalty, even if I would allow it and implement it. That is because the parser is a state machine, which means that the internal structure should change to both allowing multi-byte characters and handling them (1st check on start of each of them, then read-ahead if the next is part of the "character" and so on. I already allow this on eol up to 8 characters, which was a pain in the ass to do safely. I'm not saying it is impossible, but I'm not sure if it is worth development time.

    You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).


Enjoy, Have FUN! H.Merijn
character