Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Regex with malformed CSV files

by ickyb0d (Monk)
on Jan 10, 2006 at 19:33 UTC ( [id://522293]=perlquestion: print w/replies, xml ) Need Help??

ickyb0d has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm currently making a program to convert CSV files from mail programs to a list of vCards. As each email program is different, each one requires a different work around for the exported CSV address book. I've figured out most myself, but i just can't figure out how to handle outlooks malformed CSV files.
First Name,Last Name,Middle Name,Name,Nickname,E-mail Address,Home Str +eet,Home City,Home Postal Code,Home State,Home Country/Region,Home Ph +one,Home Fax,Mobile Phone,Personal Web Page,Business Street,Business +City,Business Postal Code,Business State,Business Country/Region,Busi +ness Web Page,Business Phone,Business Fax,Pager,Company,Job Title,Dep +artment,Office Location,Notes Joe,Schmoe,L,Joe L Schmoe,"Joe, "the shark" Smith" ,joe@here.com,"270 +E. Willbur Ave Apt 378",Atlanta,30823,GA,USA,HOMEPHONE,FAXPHONE,CELLPHONE,http://perl +monks.com,"233 N. Ocean Drive Suite 300",Pheonix,73829,AZ,USA,http://joeswork.com,WORKPHONE,WORKFAX,WORKPA +GER,Joes Company,Joes Title,IT Technologies,Office,here are some note +s "Smith,",Tony,,"Smith, Tony",,tony@smith.com,,,,,,,,,,,,,,,,,,,,,,,


Notice how the \r\n at the end of each line normally separate entries in a CSV file, however, here they can be part of the actual field value. Outlook doesn't handle quotations very well either, since it doesn't mark them. You can see this in Joe's nickname as "Joe, "The Shark" Smith". any other CSV gererator would make it as "Joe, ""The Shark"" Smith" as opposed to the one shown above.

One would normally use Text::ParseWords::parse_line to get the quoted values, but that doesn't work in this case, and with the broken lines, it is unable to find two instances of quotes for some fields. I've also considered counting the number of fields just by splitting the commas, but even then i'll end up getting an innaccurate count due to quotes and possible embedded commas within the quotes... Right now i'm just wondering if anyone might have any insight onto how to handle this... or perhaps some guidance as to what i would even begin to start with for a RegEx. Thanks!

Replies are listed 'Best First'.
Re: Regex with malformed CSV files
by jZed (Prior) on Jan 10, 2006 at 19:39 UTC
    Newlines and commas embedded in double quotes are not a problem, Text::CSV_XS (with binary=>1) and Text::xSV will handle them fine. As for unescaped quotes inside quotes, I can't imagine there's a way to automate that. Personally, I'd parse the file with Text::CSV_XS and check for parse sucess, writing successfull parses to one file and unsuccessful ones to another and then escape the quotes in the bad file by hand :-(.
      Thanks for the info! Here's what i ended up doing using Text::CSV_XS

      while(!($csv->parse($line)) || $csv->fields < $COLUMNS) { $line =~ s/\r//g; $line =~ s/\n$/=0D=0A/; $line .= <IN>; }


      So while it's an invalid line (csv->parse) or the number of fields (csv->fields) isn't equal to the number of colums... it just keeps grabbing lines.

      Thanks again!
Re: Regex with malformed CSV files
by Old_Gray_Bear (Bishop) on Jan 10, 2006 at 19:42 UTC
    Take a look at the Text::CSV* modules in CPAN. I have had very good luck using Text::CSV_XS to go back and forth between MS Excel spread-sheets and compound Perl data structures that I can feed into XML.

    Update: Fixed typo, CVS should have been CSV. (thank you Tanktalus)

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Regex with malformed CSV files
by bluto (Curate) on Jan 10, 2006 at 20:33 UTC
    I'd probably just document the fact that Outlook is broken and not supported, or do what jZed suggests. You may be able to do a better job if you parse the problem lines yourself and examine individual fields for clues on where you are at in a "line".

    For example, most fields will probably not have embedded newlines in them; a zip code will probably not have embedded quotes or commas and will probably be short; email addresses will tend to have '@' in them; quoted fields will probably not strech 100's of characters; etc.

    If you enforce some rules like this, your parser may be able to determine most of the time where it is. Of course, you could go stark raving mad in a futile effort trying to figure out the perfect ruleset...

      well so far the above solution solved the malformed portion of outlook regarding newlines. to solve the problem of the embedded quotes i'm using...
      $line =~ s/(?<!,)"{1,2}(?!,)/""/g;
      basically just adds two quotes if it finds one, or two quotes that are not preceeded or followed by a (,) comma. This still won't solve a rule if there's one such as

      "my address is ",320 main street" virginia". but in all honesty... how often is it going to happen? :) thanks for the help everyone.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://522293]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-19 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found