in reply to
Re^6: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation
If U+00AE is just a placeholder for newlines *inside* fields, my proposed solution works fine.
I have been playing with thoughts about BOM handling quite a few times already, but came to the same conclusion time after time: the advantage is not worth the performance penalty, which is huge.
Text::CSV_XS is written for sheer speed, and having to check BOM on every record-start (yes, eventually that is what it turns out to be if one wants to support streams) is not worth it. It is relatively easy to
- Do BOM handling before Text::CSV_XS starts parsing
- Write a wrapper or a super-class that does BOM handling
- Non-ASCII characters for sep/quote/escape
Any of these will imply a speed penalty, even if I would allow it and implement it. That is because the parser is a state machine, which means that the internal structure should change to both allowing multi-byte characters and handling them (1st check on start of each of them, then read-ahead if the next is part of the "character" and so on. I already allow this on eol up to 8 characters, which was a pain in the ass to do safely. I'm not saying it is impossible, but I'm not sure if it is worth development time.
You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).
Enjoy, Have FUN! H.Merijn