http://www.perlmonks.org?node_id=598431

bart has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use a plain perl regex s/// to fix up the formatting of fields in a CSV file, so that the real parser will no longer choke on it. The fields, separated by semicolons, are formatted like this:

What I'm trying to do is to leave the quoted fields alone, replace the comma in numeric fields with ".", and drop the unquoted question mark.

The basis of what I've been using looks like this — I've added extensive regex comment, describing what it does:

s( ("[^"]*") # a quoted field, or standalone part of a field | (?<![^;]) # start of line or preceded by semicolon = start +of field ( [\-\d,]+ # characters most likely forming a number | ([?]) ) # or a "?" (?![^;]) # end of line or followed by semicolon = end of +field ) # end of regex, start of substitution { $1 or # replace quoted string by itself = skip $3 ? '' # a bare unquoted '?', delete : do { (my $number = $2) # must be a number =~ tr/,/./; # replace ',' with '.' $number } # return value }xge;

Now the part that I'm having some trouble with: I'm trying to add support for multiline records, thus containing newlines within quoted strings, but without reading in the whole data file at once. Now I can detect if a quoted string is still open by making the closing quote optional, and checking for its presence. The problem is: how do you continue parsing the same open string, until you find the first semicolon, on the next line?

My idea was that, if the previous line was closed, the pattern should work as above, but if we were in a quoted field at the end, it should behave like:

m( ( (?:^|") [^"]* ("?) ) | (?<![^;]) ( [\-\d,]+ | ([?]) ) (?![^;]) )x
instead. Now how do you do that? I've tried experimenting with the, still marked as "highly experimental" after over 5 years, features of (?{CODE})but I don't quite get it, and I couldn't get it to work properly. Because of its "experimental nature" (it may be here to stay, but that doesn't mean it has been properly debugged), I'd like to avoid it, anyway.

I've also though about using /"/g to skip any leading remainders of a quoted string, but s///g simply ignores \G.

So... What would you do?