Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Parsing CSV

by kumoparris (Initiate)
on Apr 03, 2014 at 16:49 UTC ( #1080972=perlquestion: print w/ replies, xml ) Need Help??
kumoparris has asked for the wisdom of the Perl Monks concerning the following question:

I am struggling with parsing a csv file. My csv file unfortunately has some double quotes and commas in the column data. Below is some sample data. Columns 3, 4 and 5 contain names in the format (last, first middle). To add to the complexity one or more of the name columns might be empty, as in the second, third and fourth sample lines.

Examples

column1,column2,column3,column4,column5,column6,column7

data,data,"lname, fname mi","lname, fname mi","lname fname mi",data,data

data,data,"lname, fname mi","lname, fname mi",,data,data

data,data,"lname, fname mi",,,data,data

data,data,,,,data,data

Comment on Parsing CSV
Re: Parsing CSV
by choroba (Abbot) on Apr 03, 2014 at 17:01 UTC
    What did you try? Do you use Text::CSV or Text::CSV_XS?
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Reaped: Re: Parsing CSV
by NodeReaper (Curate) on Apr 03, 2014 at 17:01 UTC
Re: Parsing CSV
by davido (Archbishop) on Apr 03, 2014 at 17:12 UTC

    Wow, I wonder if anybody has ever had to deal with CSV that has been made non-trivial by quotes and embedded commas? I bet they have... Perl's been around a long time, after all. Maybe I should search CPAN to see if one of the 15,000 distributions there can be useful to me: Text::CSV, Text::CSV_XS.

    Sample code...

    use Text::CSV; my $csv = Text::CSV->new or die Text::CSV->error_diag; while( my $row = $csv->getline(\*DATA) ) { print scalar @$row, " columns: (", join( ') (', @$row ), ")\n"; } __DATA__ column1,column2,column3,column4,column5,column6,column7 data,data,"lname, fname mi","lname, fname mi","lname fname mi",data,da +ta data,data,"lname, fname mi","lname, fname mi",,data,data data,data,"lname, fname mi",,,data,data data,data,,,,data,data

    Sample output:

    $ ./mytest.pl 7 columns: (column1) (column2) (column3) (column4) (column5) (column6) + (column7) 7 columns: (data) (data) (lname, fname mi) (lname, fname mi) (lname fn +ame mi) (data) (data) 7 columns: (data) (data) (lname, fname mi) (lname, fname mi) () (data) + (data) 7 columns: (data) (data) (lname, fname mi) () () (data) (data) 7 columns: (data) (data) () () () (data) (data)

    (Adapted from the SYNOPSIS section in Text::CSV's POD.)


    Dave

      Dave

      Thanks for the response. I am a bit new here and did look at Text::CSV, but couldn't figure out how to solve the entire issue. I was hoping to also remove the comma in the names in the same pass.

        Text::CSV won't do it directly, but if you're ok with just blindly removing commas, you can do this:

        use Text::CSV; my $csv = Text::CSV->new or die Text::CSV->error_diag; while( my $row = $csv->getline(\*DATA) ) { tr/,//d for @$row; print scalar @$row, " columns: (", join( ') (', @$row ), ")\n"; } __DATA__ column1,column2,column3,column4,column5,column6,column7 data,data,"lname, fname mi","lname, fname mi","lname fname mi",data,da +ta data,data,"lname, fname mi","lname, fname mi",,data,data data,data,"lname, fname mi",,,data,data data,data,,,,data,data

        ...which produces...

        s$ ./mytest.pl 7 columns: (column1) (column2) (column3) (column4) (column5) (column6) + (column7) 7 columns: (data) (data) (lname fname mi) (lname fname mi) (lname fnam +e mi) (data) (data) 7 columns: (data) (data) (lname fname mi) (lname fname mi) () (data) ( +data) 7 columns: (data) (data) (lname fname mi) () () (data) (data) 7 columns: (data) (data) () () () (data) (data)

        That seems like it's almost there. But if you need to reverse the order of "words" within the fields that have embedded commas, you could do this:

        use Text::CSV; my $csv = Text::CSV->new or die Text::CSV->error_diag; while( my $row = $csv->getline(\*DATA) ) { s/([^,]+),\s*([^,]+)/$2 $1/ for @$row; print scalar @$row, " columns: (", join( ') (', @$row ), ")\n"; } __DATA__ column1,column2,column3,column4,column5,column6,column7 data,data,"lname, fname mi","lname, fname mi","lname fname mi",data,da +ta data,data,"lname, fname mi","lname, fname mi",,data,data data,data,"lname, fname mi",,,data,data data,data,,,,data,data

        ...which puts "lname, fname mi" into "fname mi lname" order, providing the following results:

        $ ./mytest.pl 7 columns: (column1) (column2) (column3) (column4) (column5) (column6) + (column7) 7 columns: (data) (data) (fname mi lname) (fname mi lname) (lname fnam +e mi) (data) (data) 7 columns: (data) (data) (fname mi lname) (fname mi lname) () (data) ( +data) 7 columns: (data) (data) (fname mi lname) () () (data) (data) 7 columns: (data) (data) () () () (data) (data)

        FWIW, I am happy to deal with questions where a commonly used CPAN solution is falling short of what you need, but the original question made no mention of what solutions had been explored, or how they failed to meet the need. You tend to get warmer and far more useful responses if you provide us with details on what you tried, how it failed to fill the requirement, a small self-contained code example, sample input (as you did), and sample output.

        Often, in composing an effective question, the research that goes into its composition leads you to the answer without even asking. But in those cases where the answer is still elusive, you'll be able to demonstrate exactly what small part of the bigger picture is holding you up so that our answers hit the mark on the first try, rather than after a round or two of follow-ups (as has been the case here).

        I like giving good answers. But sometimes I simply can't devote the time because it's obvious that it's going to take several rounds of asking for additional information before we get to the meat of the problem. That's an unfortunate situation for the person asking, because his question doesn't get the quality of answers he needs, when it could have gotten a quick and accurate answer had it only provided more context.

        Anyway, welcome to the Monastery, and I hope you find it both helpful and encouraging to your efforts to learn Perl.


        Dave

Re: Parsing CSV
by kennethk (Monsignor) on Apr 03, 2014 at 17:12 UTC
    The right answer, as choroba and davido say, is don't roll your own: Text::CSV is well tested and already handles standard escapes.

    Hypothetically, if you wanted to roll your own, you'd probably want to write a state machine that parsed your strings character by character. You would track if you have passed an even number of double quotes, and if so, empty the buffer onto your array. You'd probably also want to be aware that "" is sometimes used to represent the " literal in the context of a string; in general, you need to know the escaping rules followed for any formatted text object you parse.

    Or, you know, you could download code from CPAN.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1080972]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2014-08-22 05:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (147 votes), past polls