Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Trying to find missing closing quotes in CSV

by spstansbury (Scribe)
on Sep 16, 2013 at 16:10 UTC ( #1054307=perlquestion: print w/ replies, xml ) Need Help??
spstansbury has asked for the wisdom of the Perl Monks concerning the following question:

I'm hoping that I'm just overlooking something simple.

I'm trying to clean up some badly formatted CSV files that have missing closing quotes. I've tried the following:

perl -ne 'print $_ if $_ =~ /[^"]$/' < test.csv

but that is returning everything. Any thoughts would be appreciated!

Comment on Trying to find missing closing quotes in CSV
Download Code
Re: Trying to find missing closing quotes in CSV
by Tux (Monsignor) on Sep 16, 2013 at 16:22 UTC

    Install (and use) Text::CSV_XS and its utilities:

    $ csv-check test.csv # CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ rec 1 pos 3 +6 Checked test.csv with csv-check 1.5 using Text::CSV_XS 1.01 test.csv line 3/36 - 2027 - EIQ - Quoted field not terminated

    Depending on how well/bad the CSV is formatted, some extra junk may surface


    Enjoy, Have FUN! H.Merijn
      Thanks very, very much for the pointer to csv-check.pl
Re: Trying to find missing closing quotes in CSV
by Laurent_R (Parson) on Sep 16, 2013 at 19:04 UTC

    Leaving asides the question of whether it is a good idea to use regex for this (and it is probably not), there are a couple of points about your one-liner.

    First, you could rewrite a simpler form as follows:

    perl -ne 'print if /[^"]$/' test.csv

    The idea is that 'perl -ne' assigns each line to $_, which also happens to be the default variable for print and for regex matching. But that does not matter much, it is just a simplification.

    The second point is that your regex is plain wrong (although I don't even know what you are really looking for): /[^"]$/ looks for (and your code will print) all lines that do not contain a double quote before a new line. /[^"] is a character class meaning any character but the double quote, and $/ is the input record separator, defaulted to new line if not specified otherwise.<trike>So, basically, your code prints any input line not ending with a double quote. This is very unlikely to be what you wanted.

    Update: Stupid mistake on my part. I first typed that the regex was looking for a non double-quote at the end of a line, and then, while still typing, saw the / after the $ sign and thought, "oh, no, it is not end of line but input record separator", at the time I could no longer see the original post, so I made a wrong correction to what I had initially typed. My first impulse was of course the good one.

    Having said that, it may not change that much to my conclusion. The regex is probably not doing what is wanted. Given the usual structure of a CSV, a closing quote means, in my mind, a quote closing a quoted string, not a quote ending the line. And the fact that the OP is getting all the lines printed seems to confirm this interpretation.

    Now if the sentence "closing quote" meant a quote at the end of the line, then, of course, I am wrong on this, a regex would be an appropriate tool, and the regex might need an optional single or multiple space between the non quote character and the end of line, something like this: /^"\s*$/, and/or an optional end of line character. Well, anyway, difficult to say for sure without seeing a data sample.

      You're mistaken, and more than once. First, finding lines ending with other-than-double-quote is exactly what was wanted. Next, the regex doesn't have anything to do with the record separator. The dollar sign represents end-of-string, and the slash closes the regex. Finally, you say the code prints any line not ending with double quote, but it actually prints every line it's being fed, regardless of double quotes, because the input lines aren't chomp'ed and [^"] matches the newlines.

      Hope that's clear.

      Unfortunately, pretty much everything you wrote in that last paragraph is wrong in way or another. Perhaps you misread something then propagated that misreading throughout.

      "/[^"]$/ looks for (and your code will print) all lines that do not contain a double quote before a new line."

      Clearly that's not the case. See my response to the OP below, or this specific example which does "contain a double quote before a new line":

      $ perl -Mstrict -Mwarnings -E '$_ = qq{"\n}; say if /[^"]$/' "
      "/[^"] is a character class meaning any character but the double quote, ..."

      That's the opening delimiter for the regex (/) followed by a character class ([^"]). I thought that may have been a simple typo, and perhaps I was being pedantic, but as I continue to read that sentence:

      "... and $/ is the input record separator, defaulted to new line if not specified otherwise."

      No, that's not what $/ is in this context. It's an assertion anchoring the end of the line ($) followed by the closing delimiter for the regex (/).

      "So, basically, your code prints any input line not ending with a double quote. This is very unlikely to be what you wanted."

      Given the title, "Trying to find missing closing quotes in CSV", code which "prints any input line not ending with a double quote" seems more likely than "very unlikely" in terms of what the OP wants.

      -- Ken

      good idea to use regex for this

      if  /[^"]$/ is a regex match this means m// is the operator and  [^"]$ is the regex pattern (the regex )

      the / and / are delimiters, they're not part of the regex pattern; you use a regex by using m//atch or s///ubstitute operators

      m//atch operator s///ubstitution operator tr///ansliteration operator

      to m//atch to s///ubstitute to tr///ansliterate

      it sinks in quicker when you realize the regex are arguments top m//atch and s///ubstitute, and tr///ansliterate takes chars/char-ranges not regex

      Simple matching, Simple substitution

      :)

Re: Trying to find missing closing quotes in CSV
by kcott (Abbot) on Sep 16, 2013 at 23:13 UTC

    G'day spstansbury,

    I see Tux has provided you with a better way to tackle your task. The reason you're getting the results you see with your posted solution is as follows.

    Given this test.csv (newlines shown as "$" and tabs as "^I"):

    $ cat -vet test.csv quote at end"$ no quote at end$ quote and space at end" $ quote and tab at end"^I$

    Your posted code matches every line:

    $ perl -ne 'print $_ if $_ =~ /[^"]$/' < test.csv quote at end" no quote at end quote and space at end" quote and tab at end"

    The reason for this becomes apparent when you look at exactly what $_ contains:

    $ perl -ne 'print ">>>$_<<<\n" if $_ =~ /[^"]$/' < test.csv >>>quote at end" <<< >>>no quote at end <<< >>>quote and space at end" <<< >>>quote and tab at end" <<<

    Each line ends with a newline which, not being a double-quote, means /[^"]$/ matches.

    You can get around this most easily but simply using the -l switch:

    $ perl -lne 'print $_ if $_ =~ /[^"]$/' < test.csv no quote at end quote and space at end" quote and tab at end"

    If you look at perlrun, you'll see that's equivalent to chomping each input line and appending a newline (or whatever the current value of $/ is) to each output line:

    $ perl -ne 'chomp; print "$_\n" if $_ =~ /[^"]$/' < test.csv no quote at end quote and space at end" quote and tab at end"

    [Note: the -l switch is a little more complicated than my simplistic explanation - see the doco for details.]

    -- Ken

Re: Trying to find missing closing quotes in CSV
by hdb (Prior) on Sep 17, 2013 at 08:03 UTC

    Alternatively, you could be looking for lines with an odd number of quotes:

    use strict; use warnings; while(<DATA>){ print if tr/"/"/ % 2 } __DATA__ 1,2,3 "1",2,3 1,2,"3

      Bad idea. In perfectly correct CSV, it is possible to have odd number of " perl line in many different ways. This is valid CSV (all lines with " have an odd number of "):

      $ cat test.csv 1,"2,","3"04" 2,"newline there -> and more text for row 2",There 3,,Bah $ csv-check test.csv Checked test.csv with csv-check 1.5 using Text::CSV_XS 1.01 OK: rows: 3, columns: 3 sep = <,>, quo = <">, bin = <1>, eol = <"\n"> $

      Parsing CSV with regular expressions and/or on perl default readline method is a dangerous path to walk.


      Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1054307]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2014-12-28 08:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (179 votes), past polls