http://www.perlmonks.org?node_id=363456

Wally Hartshorn has asked for the wisdom of the Perl Monks concerning the following question:

My apologies if this is an FAQ, but I've been unable to find an answer.

I have a comma-separated values (CSV) file exported from a FoxPro database. The text fields within the CSV file are enclosed within double-quotes.

Unfortunately, some of the fields contain embedded CR-LF characters. The DBD::CSV module interprets those CR-LF characters as end-of-record markers. I haven't found any way to tell DBD::CSV that a CR-LF pair within a quoted field is not an end-of-record marker, so I'm using a regex to convert the CR-LF pairs into HTML <br> tags. (The text will be displayed in a browser, so that's what I want anyway.)

The code I've developed to do this is ugly and not bulletproof. It assumes that a quote, followed by a CR-LF, followed by a quote should be treated as the closing quote of a the last field of one record, followed by an end-of-record, followed by the opening quote of the first field of the next record. This is not always true.

{ undef $/; $slurp = <$fh>; # slurp up the file # append a bogus final quote to the end of the file $slurp .= '"'; # replace the end-of-record CR-LFs $slurp =~ s/"\r\n"/"__EOR__"/g; # replace the other CR-LFs with <br> tags $slurp =~ s/\r\n/<br>/g; # restore the end-of-record CR-LFs $result =~ s/"__EOR__"/"\r\n"/g; # remove the bogus final quote $result =~ s/"$//; }

The correct way to do this would be to loop through the file, counting opening quotes and closing quotes, replacing any CR-LFs within an opening quote/closing quote pair with HTML <br> tags.

Doing this as a while() loop seems awkward. Is there some elegant regex that would handle this?

Wally Hartshorn