Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Auto correct a csv file

by karthikAk (Initiate)
on Feb 18, 2014 at 09:26 UTC ( [id://1075301]=perlquestion: print w/replies, xml ) Need Help??

karthikAk has asked for the wisdom of the Perl Monks concerning the following question:

Sometimes the csv file will have broken records like below. It needs to be autocorrected.

Sample data 1

abcd,732656,20140211,20140202,

58461.10,C,INR,,,Settlement of purchase.,,1

sample data 2

abcd,732656,20140211,20140202

,58461.10,C,INR,,,Settlement of purchase.,,1

Below is the perl code that i'm using. This woud correct only in cases like in sample data 1. But not in sample data2. How to tweak the code such that it works in both cases?

perl -pi.orig -e 's/,\r+\n//g;s/\n,/,/g' infile

Replies are listed 'Best First'.
Re: Auto correct a csv file
by graff (Chancellor) on Feb 19, 2014 at 01:56 UTC
    It took me a while to see what your problem is. You have some fields in your original csv file that contain embedded line breaks, but there's no quoting or escaping provided, so a "normal" CSV parse won't work very well.

    When a line ends with a comma, you're able to join it with the following line, by removing the line-break(s) after the comma. (But your regex does it wrong: it won't apply at all on LF-style data, it won't handle extra blank lines properly for CRLF-style data, and it removes the comma, which should be kept.)

    When a line begins with a comma, you want to join it to the previous line, but the previous line has already been processed and written to output, so it's too late to fix that.

    So, don't do it one line at a time - process the whole file as a single string:

    perl -e '$/=undef; $_=<>; s/,[\r\n]+/,/g; s/[\r\n]+,/,/g; print' infil +e > infile.fxd
    Those regexes preserve the commas, and handle any number of consecutive line breaks before or after a comma (for both LF and CRLF data).

    (Note that I'm redirecting output to a different file, rather than replacing the original - that makes it easy to "try, try again" for cases like this, where you seldom get it right the first time. Once you get it right, then you can rename the output to replace the input.)

    (Update: P.S.: Welcome to the Monastery!)

    (Updated again to add remarks about LF vs. CRLF data)

      This works perfectly. But it deletes the very first line. How to prevent this?

        … But it deletes the very first line. How to prevent this?

        Can you post an example data file to demonstrate this problem? (For example, just copy the first few lines from the "real" data file in question and put them in a separate file. Then run the command line to produce a modified version of that. If the first line of input is missing from the output, please post the input, the command line you actually used, and the output.

        I suspect that one (or more) of the following could be happening:

        • You aren't using the exact code that I posted.
        • What you think should be "the first line" of your input file is not actually in that file to begin with (that is, it was missing before you ran the script).
        • There's something goofy going on with carriage-return characters in your data, and the first line really is there, but you might not be "seeing" it because maybe it ends with just CR instead of CRLF, which might cause the 2nd line to be printed "on top of" the first one in your display.

        I have no way to answer your question without knowing more about your data.

Re: Auto correct a csv file
by Tux (Canon) on Feb 19, 2014 at 08:15 UTC

    I think graff summarized your problem quite well. "Fixing" CSV with regular expressions is fragile. It is possible for some very well-defined situation but likely to break in another.

    What (usually) works better is to use a well-defined parser like Text::CSV_XS and remember what a sane state is: either scan each line knowing how many fields should be in it or dynamically remember what the first sane line told you. If a line gives you a parsing error or the parsed line does not contain enough fields, back up and join the next line. You can take the beforementioned parser-xs.pl as a framework to start from.

    In your case, that could end up with something like (untested) this:

    use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, blank_is_undef => 1, eol => $/, }); my $csa = Text::CSV_XS->new ({ binary => 1, allow_loose_quotes => 1, blank_is_undef => 1, escape_char => undef, }); my $file = @ARGV ? shift : "test.csv"; open my $fh, "<", $file or die "$file: $!\n"; my %err_eol = map { $_ => 1 } 2010, 2027, 2031, 2032; print STDERR "Reading $file with Text::CSV_XS $Text::CSV_XS::VERSION\n +"; my $nf = 0; my @pv; while (1) { my $row = $csv->getline ($fh); if ($row) { if (@pv) { # previous line ended with embedded newline. $pv[-1] .= "\n" . shift @$row; unshift @$row, @pv; @pv = (); } $nf ||= @$row; if (@$row < $nf) { @pv = @$row; redo; } } else { # Parsing failed # Could be end of file $csv->eof and last; # Diagnose and show what was wrong my @diag = $csv->error_diag; print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n"; my $ep = $diag[2] - 1; # diag[2] is 1-based my $ein = $csv->error_input; # The line scanned so far my $err = $ein . " "; substr $err, $ep + 1, 0, "*"; # Bad character marked between + ** substr $err, $ep, 0, "*"; ($err = substr $err, $ep - 5, 12) =~ s/ +$//; print STDERR " |$err|\n"; REPARSE: { # Now retry with allowed options if ($csa->parse ($ein)) { print STDERR "Accepted in allow mode ...\n"; $row = [ $csa->fields ]; } else { # Still fails my @diag = $csa->error_diag; if (exists $err_eol{$diag[0]}) { # \r or \n inside fie +ld print STDERR " Extending line with next chunk\n"; $ein .= scalar <$fh>; goto REPARSE; } print STDERR " Also could not parse it in allow mode\ +n"; print STDERR " $./$diag[2] - $diag[0] - $diag[1]\n"; print STDERR " Line skipped\n"; next; } } } # Data was fine, print data properly quoted $csv->print (*STDOUT, $row); }

    Enjoy, Have FUN! H.Merijn
Re: Auto correct a csv file
by ww (Archbishop) on Feb 18, 2014 at 12:42 UTC
    It would be easier to help if I knew "what is it that's broken?" rather than needing to interpret your (admittedly broken, or -- at best -- inadequate) use of regex.

    :-(

    Come, let us reason together: Spirit of the Monastery
Re: Auto correct a csv file ( csv-check )
by Anonymous Monk on Feb 18, 2014 at 13:23 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1075301]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-03-19 05:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found