|Perl: the Markov chain saw|
Thanks for your reply.
I gather that the "CRLF" pairs that serve to terminate records are not enclosed in any kind of quotes, whereas data fields that include "CRLF" as content must be quoted (using the U+00FE string delimiter).
Yes. Concordance DAT records are ordinary, well-formed CSV records. The <CR><LF> pairs that serve to terminate the records are outside any quoted string. Literal occurrences of <CR>, <LF> and <CR><LF> pairs are inside quoted strings.
The only thing special about the CSV records in Concordance DAT files is the peculiar metacharacters.
Apart from that, I'm not sure I understand what you're saying about the BOM (U+FEFF)... What in particular needs to be done to "handle it properly"? (In UTF-8 data, it's sufficient to just ignore/delete it without further ado, or perhaps include it at the beginning of one's output, if one expects that a downstream process will be looking for it.)
It must be handled as specified in the Unicode Standard. Upon reading the UTF-8 data stream, it must be treated as a special character and not as part of the text. In the specific case of a CSV file, it must not be wrongly treated as a non-delimited string that is the leftmost field in the first record.
UPDATE: This Perl script…
…fails with this error message:
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ pos 4