Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Recommendations for parsing invalid CSV

by markjugg (Curate)
on Apr 21, 2008 at 13:58 UTC ( #681919=perlquestion: print w/ replies, xml ) Need Help??
markjugg has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse invalid CSV in a situation where it's possible to get the source to generate valid CSV. The problem is that quoting isn't handled right, so this may appear in the file:

  "call from "friend""

My best idea for handling it is to try to pre-parse it to make it valid, and then hand the result to Text::CSV_XS to finish the job. The logic for "fixing" it might be like this:

- If a quote character appears at the beginning or end of a line, or next to a comma, consider it a part of the file delimiter. Otherwise, consider it an internal quote and escape it properly.

I realize there are edge cases that wouldn't be handled by this logic, but it does seem like a case that difficult to correct perfectly automatically.

My hope is that someone might now of a CSV parsing tool that tolerates this particular kind of broken-ness.

Thanks!

Comment on Recommendations for parsing invalid CSV
Re: Recommendations for parsing invalid CSV
by Corion (Pope) on Apr 21, 2008 at 14:06 UTC

    I'm not aware of any tool that does that, but it isn't hard to find the problematic lines, as these will have more delimiters in them than the number of columns. All other lines are trivially decodable and pose no problem, as the quotes are unnecessary there.

    For the problematic lines, I would parse the complete line and aim for ," to start a quoted string and ", to end said string. I would also log all these problematic lines for later human inspection. You could also consider to ask a human about these problematic lines instead of spilling one interpretation into the DB.

      Corion,

      I think the suggestion to focus on the "problem lines" is a good one. If I have to pre-parse the file by hand, I think I'll do that.

Re: Recommendations for parsing invalid CSV
by Tux (Monsignor) on Apr 21, 2008 at 14:20 UTC

    Are you by any chance working together with Michael Fowler? He asks almost the same thing in this RT entry.

    See my comments there.


    Enjoy, Have FUN! H.Merijn
      I am not. I'm working with an export of an older version of the "TimeMatters" software.

      Regarding what Text::CSV_XS can do, in summary are you saying that one of the "allow_loose_quotes" or "allow_loose_escapes" option can or will be able to handle this?

      That would certainly be the easiest path for me. Thanks so much for the fast and helpful response!

        Yes. You'd have to play with allow_loose_quotes, allow_loose_escapes and/or setting escape_char to undef to find your best combination. The upcoming 0.43 has a new util examples/csv-check, which will tell you line number and position in-line of where the first failure happens


        Enjoy, Have FUN! H.Merijn
Re: Recommendations for parsing invalid CSV
by mscharrer (Hermit) on Apr 21, 2008 at 14:54 UTC
    You could use look-behind (?<= ) and look-ahead (?= ) expressions to look only for quotes which are not beside a comma or at end and start of the line:
    s/(?<=.)(?<!,)"(?!,|$)/\\"/g A simple test in the command line brings me:
    user@machine$ perl -pe 's/(?<=.)(?<!,)"(?!,|$)/\\"/g' "call from "friend"","call from "friend"","call from "friend"" "call from \"friend\"","call from \"friend\"","call from \"friend\""
    The 2nd line is input, the 3rd output.

    Looks good for me. Some special cases with escaped commas inside the strings might not match correctly. You should check this.

      mscharrer,

      I didn't mean for anyone else to do my work for me, but I certainly don't mind the help! Thank you! That's the kind of code I would need to write if the new release of Text::CSV_XS turns out not to work for some reason.

Re: Recommendations for parsing invalid CSV
by Tux (Monsignor) on Apr 21, 2008 at 22:48 UTC

    OK, I nailed it. C<allow_loose_quotes> now also works inside quoted fields, under the condition that the C<escape_char> is not equal to the C<quote_char>. Code like that is very unlikely to have some valid escape sequence anyway. From 0.43 onward, you can parse those lines with

    my $csv = Text::CSV_XS->new ({ binary => 1, allow_loose_quotes => 1, escape_char => undef, });

    Enjoy, Have FUN! H.Merijn
Re: Recommendations for parsing invalid CSV
by Bloodnok (Vicar) on Apr 22, 2008 at 13:51 UTC
    Hi ,

    Why not try Text::xSV ?

    ... then you don't have to bother about such things.

    From experience, the only thing that can't handle is a CSV export by Excel where a field is extraordinarily large (such that the field is split ... by Excel) ?

    At last, a user level that best describes my experience :-))

      There might be several reasons for choosing the right   module. See here for a more thorrough discussion of the differences between the available CSV parsing modules.

      Note that this initial post is somewhat outdated, and some of the +/-'s might have changed. e.g. Text::CSV_XS not can deal very well with NULL (undef) values.


      Enjoy, Have FUN! H.Merijn
Re: Recommendations for parsing invalid CSV
by Tux (Monsignor) on Apr 23, 2008 at 12:18 UTC

    Code speaks loader than words, and I got tired of explaining over and over and over again why bad CSV is bad and how to 'try to' parse invalid data-streams anyway, so I now have included a script that parses CSV streams, and reports lines that are wrong and then re-parses that line in `allow-mode'.

    I'll release shortly.


    Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://681919]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (14)
As of 2014-07-30 19:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (239 votes), past polls