|Syntactic Confectionery Delight|
Finding CR-LFs within quoted CSV fieldsby Wally Hartshorn (Friar)
|on Jun 11, 2004 at 16:07 UTC||Need Help??|
Wally Hartshorn has asked for the
wisdom of the Perl Monks concerning the following question:
My apologies if this is an FAQ, but I've been unable to find an answer.
I have a comma-separated values (CSV) file exported from a FoxPro database. The text fields within the CSV file are enclosed within double-quotes.
Unfortunately, some of the fields contain embedded CR-LF characters. The DBD::CSV module interprets those CR-LF characters as end-of-record markers. I haven't found any way to tell DBD::CSV that a CR-LF pair within a quoted field is not an end-of-record marker, so I'm using a regex to convert the CR-LF pairs into HTML <br> tags. (The text will be displayed in a browser, so that's what I want anyway.)
The code I've developed to do this is ugly and not bulletproof. It assumes that a quote, followed by a CR-LF, followed by a quote should be treated as the closing quote of a the last field of one record, followed by an end-of-record, followed by the opening quote of the first field of the next record. This is not always true.
The correct way to do this would be to loop through the file, counting opening quotes and closing quotes, replacing any CR-LFs within an opening quote/closing quote pair with HTML <br> tags.
Doing this as a while() loop seems awkward. Is there some elegant regex that would handle this?