$ csv-check test.csv
# CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ rec 1 pos 3
+6
Checked test.csv with csv-check 1.5 using Text::CSV_XS 1.01
test.csv line 3/36 - 2027 - EIQ - Quoted field not terminated
Depending on how well/bad the CSV is formatted, some extra junk may surface
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] |
Thanks very, very much for the pointer to csv-check.pl
| [reply] [Watch: Dir/Any] [d/l] |
G'day spstansbury,
I see Tux has provided you with a better way to tackle your task.
The reason you're getting the results you see with your posted solution is as follows.
Given this test.csv (newlines shown as "$" and tabs as "^I"):
$ cat -vet test.csv
quote at end"$
no quote at end$
quote and space at end" $
quote and tab at end"^I$
Your posted code matches every line:
$ perl -ne 'print $_ if $_ =~ /[^"]$/' < test.csv
quote at end"
no quote at end
quote and space at end"
quote and tab at end"
The reason for this becomes apparent when you look at exactly what $_ contains:
$ perl -ne 'print ">>>$_<<<\n" if $_ =~ /[^"]$/' < test.csv
>>>quote at end"
<<<
>>>no quote at end
<<<
>>>quote and space at end"
<<<
>>>quote and tab at end"
<<<
Each line ends with a newline which, not being a double-quote, means /[^"]$/ matches.
You can get around this most easily but simply using the -l switch:
$ perl -lne 'print $_ if $_ =~ /[^"]$/' < test.csv
no quote at end
quote and space at end"
quote and tab at end"
If you look at perlrun, you'll see that's equivalent to chomping each input line and appending a newline (or whatever the current value of $/ is) to each output line:
$ perl -ne 'chomp; print "$_\n" if $_ =~ /[^"]$/' < test.csv
no quote at end
quote and space at end"
quote and tab at end"
[Note: the -l switch is a little more complicated than my simplistic explanation - see the doco for details.]
| [reply] [Watch: Dir/Any] [d/l] [select] |
Leaving asides the question of whether it is a good idea to use regex for this (and it is probably not), there are a couple of points about your one-liner.
First, you could rewrite a simpler form as follows:
perl -ne 'print if /[^"]$/' test.csv
The idea is that 'perl -ne' assigns each line to $_, which also happens to be the default variable for print and for regex matching. But that does not matter much, it is just a simplification.
The second point is that your regex is plain wrong (although I don't even know what you are really looking for): /[^"]$/ looks for (and your code will print) all lines that do not contain a double quote before a new line. /[^"] is a character class meaning any character but the double quote, and $/ is the input record separator, defaulted to new line if not specified otherwise.<trike>So, basically, your code prints any input line not ending with a double quote. This is very unlikely to be what you wanted.
Update: Stupid mistake on my part. I first typed that the regex was looking for a non double-quote at the end of a line, and then, while still typing, saw the / after the $ sign and thought, "oh, no, it is not end of line but input record separator", at the time I could no longer see the original post, so I made a wrong correction to what I had initially typed. My first impulse was of course the good one.
Having said that, it may not change that much to my conclusion. The regex is probably not doing what is wanted. Given the usual structure of a CSV, a closing quote means, in my mind, a quote closing a quoted string, not a quote ending the line. And the fact that the OP is getting all the lines printed seems to confirm this interpretation.
Now if the sentence "closing quote" meant a quote at the end of the line, then, of course, I am wrong on this, a regex would be an appropriate tool, and the regex might need an optional single or multiple space between the non quote character and the end of line, something like this: /^"\s*$/, and/or an optional end of line character. Well, anyway, difficult to say for sure without seeing a data sample.
| [reply] [Watch: Dir/Any] [d/l] [select] |
You're mistaken, and more than once. First, finding lines ending with other-than-double-quote is exactly what was wanted. Next, the regex doesn't have anything to do with the record separator. The dollar sign represents end-of-string, and the slash closes the regex. Finally, you say the code prints any line not ending with double quote, but it actually prints every line it's being fed, regardless of double quotes, because the input lines aren't chomp'ed and [^"] matches the newlines. Hope that's clear.
| [reply] [Watch: Dir/Any] [d/l] |
Unfortunately, pretty much everything you wrote in that last paragraph is wrong in way or another.
Perhaps you misread something then propagated that misreading throughout.
"/[^"]$/ looks for (and your code will print) all lines that do not contain a double quote before a new line."
Clearly that's not the case. See my response to the OP below, or this specific example which does "contain a double quote before a new line":
$ perl -Mstrict -Mwarnings -E '$_ = qq{"\n}; say if /[^"]$/'
"
"/[^"] is a character class meaning any character but the double quote, ..."
That's the opening delimiter for the regex (/) followed by a character class ([^"]).
I thought that may have been a simple typo, and perhaps I was being pedantic, but as I continue to read that sentence:
"... and $/ is the input record separator, defaulted to new line if not specified otherwise."
No, that's not what $/ is in this context. It's an assertion anchoring the end of the line ($) followed by the closing delimiter for the regex (/).
"So, basically, your code prints any input line not ending with a double quote. This is very unlikely to be what you wanted."
Given the title, "Trying to find missing closing quotes in CSV", code which "prints any input line not ending with a double quote" seems more likely than "very unlikely" in terms of what the OP wants.
| [reply] [Watch: Dir/Any] [d/l] [select] |
good idea to use regex for this
if /[^"]$/ is a regex match this means m// is the operator and [^"]$ is the regex pattern (the regex )
the / and / are delimiters, they're not part of the regex pattern; you use a regex by using m//atch or s///ubstitute operators
m//atch operator s///ubstitution operator tr///ansliteration operator
to m//atch to s///ubstitute to tr///ansliterate
it sinks in quicker when you realize the regex are arguments top m//atch and s///ubstitute, and tr///ansliterate takes chars/char-ranges not regex
Simple matching, Simple substitution
:)
| [reply] [Watch: Dir/Any] [d/l] [select] |
Alternatively, you could be looking for lines with an odd number of quotes:
use strict;
use warnings;
while(<DATA>){
print if tr/"/"/ % 2
}
__DATA__
1,2,3
"1",2,3
1,2,"3
| [reply] [Watch: Dir/Any] [d/l] |
$ cat test.csv
1,"2,","3"04"
2,"newline there ->
and more text for row 2",There
3,,Bah
$ csv-check test.csv
Checked test.csv with csv-check 1.5 using Text::CSV_XS 1.01
OK: rows: 3, columns: 3
sep = <,>, quo = <">, bin = <1>, eol = <"\n">
$
Parsing CSV with regular expressions and/or on perl default readline method is a dangerous path to walk.
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] [select] |