Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

CSV cleanup problem...

by devnul (Monk)
on Jun 14, 2005 at 01:43 UTC ( #466344=perlquestion: print w/replies, xml ) Need Help??
devnul has asked for the wisdom of the Perl Monks concerning the following question:


I have a CSV file which needs some cleanup in order for Text::CSV_XS to be able to parse it. I know what I need to do, just not sure how to do it.

First a sample:
"title", "Some Name, "some wierd title"", "555-555-5555"

This breaks up into 3 fields:
#1. title #2. Some Name, "some wierd title" #3. 555-555-5555

I can get the above to work if I change the above to:
"title","Some Name,""some wierd title""","555-555-5555"

This required two changes:
#1. Changing /", "/ to /",/
#2. Double-quoting some wierd title

I can do #1 with a regex with no difficulty, but have no idea how to do #2.
Any ideas?

- dEvNuL

Replies are listed 'Best First'.
Re: CSV cleanup problem...
by kaif (Friar) on Jun 14, 2005 at 01:57 UTC
    The following code works at least in this case. It should also work in others, but I don't know how messed up your input CSV is:
    $_ = q("title", "Some Name, "some weird title"", "555-555-5555"); s/", "/","/g; s/(?<=[^,])"(?=[^,])/""/g; print "$_\n"; __END__ "title","Some Name, ""some weird title""","555-555-5555"

    Hope this helps!

    Update: Oh yeah, the logic in the regex: any quote that isn't preceded or succeeded by a comma is probably embedded. However, this relies on there not being any properly embedded quotes, but this is necessary. After all, does "foo""bar" contain one or two quotes? Given that the input is not properly embedded, it is not clear.

Re: CSV cleanup problem...
by GrandFather (Sage) on Jun 14, 2005 at 02:27 UTC

    if you know the input is of the form:

    "title", "messy stuff", "number"

    you can use this:

    /^"(.*?)", "(.*)", "(.*?)"/g to give title in $1, messy stuff in $2 and number in $3.

    Note that fairly fixed formatting is assumed. Use \s* as required in place of the space in '", "'.

    Perl is Huffman encoded by design.
Re: CSV cleanup problem...
by jZed (Prior) on Jun 14, 2005 at 04:48 UTC
    I guess my recommendation would be to try a regex similar to kaif's and then feed the result to Text::CSV_XS and check a) for parse errors and b) to see if you have the correct number of fields. Then write the lines that fail a or b into a separate file and deal with them by hand. (I know, yuck, but, garbage in, garbage out).
Re: CSV cleanup problem...
by BrowserUk (Pope) on Jun 14, 2005 at 05:40 UTC

    I was going to suggest that you transform

    "title", "Some Name, "some wierd title"", "555-555-5555"
    'title'|'Some Name, "some wierd title"'|'555-555-5555'
    $line =~ s[^"|"$][']g; $line =~ s[",\s"]['|']g;

    And then set

    my $csv = Text::CSV_XS->new({ 'quote_char' => "'", 'sep_char' => '|', });

    But for some reason I do not understand, Text::CSV_XS fails to parse using these parameters.

    As far as I can tell from the docs this combination ought to work, but it doesn't, and I can't find any way to determine what it is that it doesn't like.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      You need to add escape_char => q{'} or somesuch, otherwise the quote marks are treated as escapes. In the next release I intend to have the escape_char default to whatever the sep_char is set to rather than to " as it is now. This parses your sample correctly for me:
      #!perl -w use strict; use Text::CSV_XS; use IO::Scalar; my $str = q{'title'|'Some Name, "some wierd title"'|'555-555-5555'}; my $csv = Text::CSV_XS->new( { binary=>1, quote_char=>q{'}, sep_char=>q{|}, escape_char=>q{'}, } ); my $fh = IO::Scalar->new(\$str); while (my $cols = $csv->getline($fh)) { last unless @$cols; printf "%s\n", join "\n",@$cols; }
Re: CSV cleanup problem...
by tlm (Prior) on Jun 14, 2005 at 01:51 UTC

    This doesn't directly answer your question, but is this link of any use to you? (This link, BTW, asserts that Text::CSV_XS can handle embedded commas, so I'm a bit puzzled by your question.)

    Update: Thanks to jZed for the cluebrick. FWIW,

    split /(?<=")\s*,\s*(?=")/, $string
    splits the OP's sample input into
    0 '"title"' 1 '"Some Name, "some wierd title""' 2 '"555-555-5555"'
    and it is a trivial matter to remove the leading and trailing double quotes from each field; e.g.:
    my @fields = map { s/^"(.*)"\z/$1/; $_ } split /(?<=")\s*,\s*(?=")/, $string;

    the lowliest monk

      The problem in devnul's data isn't embedded commas, or embedded quote marks (both of which Text::CSV_XS handles fine as long as the embedded quotes are escaped (by your choice of escape character). The problem with the data is un-escaped embedded quotes marks. My guess is that unless there are some rather arbitrary known things about the embedded quote marks, it will be darn tough to deal with them.
Re: CSV cleanup problem...
by ikegami (Pope) on Jun 14, 2005 at 02:14 UTC

    The best definition of an embedded quote I've thought up for this situation is: If the quote isn't followed by a comma, assume it's an embedded quote.

    # Fix embedded quotes. s/"(?![",])/""/g;

    To remove the spaces before the commas, just do it after using Text::CSV_XS.

    That's what I get for trying to answer a question on my way out.

      I would watch out there: applying your regex to the sample input results in the following:

      ""title", ""Some Name, ""some weird title"", ""555-555-5555""

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://466344]
Approved by kaif
[karlgoethebier]: Hi mario
[marioroy]: Hi karlgoethebier.
[karlgoethebier]: "Long time No See" ;-)
[marioroy]: Yes. I've been working on MCE and MCE::Shared. These are completed and released. MCE 1.830 and MCE::Shared 1.831. Saw the LWP::Simple and P::FM thread and thought to help out.

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2017-09-23 15:52 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (272 votes). Check out past polls.