Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

parsing malformed CSV with per column quote chars

by bulk88 (Priest)
on Aug 07, 2014 at 16:25 UTC ( [id://1096662]=perlquestion: print w/replies, xml ) Need Help??

bulk88 has asked for the wisdom of the Perl Monks concerning the following question:

I have some interestingly formatted CSV I need to get into a Perl hash. One of the columns uses "s as the quote char, all other columns use 's. The " column has 's in it. See the "JOHNS FLYING DOG" row. The last time this data format was used, I wrote text editor regexes to turn the CSV data into a literal perl hash of array refs, then processes that data structure into a hash of hashes. I'd like a better solution than the previous one (I also lost the previous solution). How do I get this CSV into a Perl hash? I am not sure Text::CSV has the options to do it, and I couldn't easily find any other CSV parser engines on CPAN (everything seems to be wrappers around Text::CSV or CSV_XS).
use Text::CSV::Hashify; use Data::Dumper; my $filename = 'bad.csv'; $obj = Text::CSV::Hashify->new( { file => $filename, format => 'hoh', key => "PRODUCT CODE", quote_char => "'", escape_char => "|", #wont be found in the data, turn off +escaping auto_diag => 1, } ); print Dumper($obj->all);
The CSV file
'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' '
The error message
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ rec 4 pos 24 $VAR1 = { ' ' => { 'CODE DESCRIPTION' => '"INVALID CODE IN USER SU +PPLIED DATA"', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'CATEGORY' => '0 ', 'PRODUCT CODE' => ' ', 'CATEGORY DESCRIPTION' => 'No Item', 'OPTIONAL CATEGORY' => ' ' }, '82131' => { 'CODE DESCRIPTION' => '"GREEN DOG"', 'CATEGORY' => '94 ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '82131', 'OPTIONAL CATEGORY' => ' ', 'CATEGORY DESCRIPTION' => 'Dog' }, '00100' => { 'OPTIONAL CATEGORY' => ' ', 'CATEGORY DESCRIPTION' => 'Cat', 'CODE DESCRIPTION' => '"ORANGE CAT"', 'CATEGORY' => '1 ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '00100' } };

Replies are listed 'Best First'.
Re: parsing malformed CSV with per column quote chars
by Tux (Canon) on Aug 07, 2014 at 18:04 UTC

    Update. Can reproduce

    $ cat test.csv 'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' ' $ perldoc -m CSV | head -7 use Text::CSV_XS qw( csv ); use Data::Peek; BEGIN { *CSV:: = \%Text::CSV_XS::; } $VERSION = "0.02"; 1; $ perl -MCSV -e'DDumper csv(in=>"test.csv",diag_verbose=>9,quote_char= +>"\x27",escape_char=>undef)' # CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ rec 4 pos 24 '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' ' ^ [ [ 'PRODUCT CODE', 'CATEGORY', 'CATEGORY DESCRIPTION', 'CODE DESCRIPTION', 'OPTIONAL CATEGORY', 'OPTIONAL CATEGORY DESCRIPTION' ], [ ' ', '0 ', 'No Item', '"INVALID CODE IN USER SUPPLIED DATA"', ' ', ' ' ], [ '00100', '1 ', 'Cat', '"ORANGE CAT"', ' ', ' ' ], [ 82131, '94 ', 'Dog', '"GREEN DOG"', ' ', ' ' ] ]

    You need allow_loose_quotes:

    perl -MCSV -e'DDumper csv(in=>"test.csv",diag_verbose=>9,quote_char=>" +\x27",escape_char=>undef,allow_loose_quotes=>1)' [ [ 'PRODUCT CODE', 'CATEGORY', 'CATEGORY DESCRIPTION', 'CODE DESCRIPTION', 'OPTIONAL CATEGORY', 'OPTIONAL CATEGORY DESCRIPTION' ], [ ' ', '0 ', 'No Item', '"INVALID CODE IN USER SUPPLIED DATA"', ' ', ' ' ], [ '00100', '1 ', 'Cat', '"ORANGE CAT"', ' ', ' ' ], [ 82131, '94 ', 'Dog', '"GREEN DOG"', ' ', ' ' ], [ 82132, '94 ', 'Dog', '"\'JOHNS\' FLYING\' DOG (Start 2001)"', ' ', ' ' ], [ 82133, '94 ', 'Dog', '"MAGENTA DOG (End 2009)"', ' ', ' ' ] ]

    To get (with Text::CSV_XS) hashes, you need headers and a map:

    $ perl -MCSV -e'DDumper{map{$_->{"PRODUCT CODE"}=>$_}@{csv(in=>"test.c +sv",diag_verbose=>9,quote_char=>"\x27",escape_char=>undef,allow_loose +_quotes=>1,headers=>"auto")}}' { ' ' => { CATEGORY => '0 ', 'CATEGORY DESCRIPTION' => 'No Item', 'CODE DESCRIPTION' => '"INVALID CODE IN USER SUPPLIED DATA"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => ' ' }, '00100' => { CATEGORY => '1 ', 'CATEGORY DESCRIPTION' => 'Cat', 'CODE DESCRIPTION' => '"ORANGE CAT"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '00100' }, 82131 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"GREEN DOG"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82131 }, 82132 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"\'JOHNS\' FLYING\' DOG (Start 2001)"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82132 }, 82133 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"MAGENTA DOG (End 2009)"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82133 } }

    Enjoy, Have FUN! H.Merijn

      You'll need to add a post-processing step to remove the double quotes. And you'll have to hope that none of the in-double-quotes fields contain any comma characters.

      - tye        

        Yes, the usual drawbacks. I was merely answering on the required options to make that work.


        Enjoy, Have FUN! H.Merijn
Re: parsing malformed CSV with per column quote chars (SMoP)
by tye (Sage) on Aug 07, 2014 at 18:28 UTC

    It isn't CSV. So just write a parser for it. This isn't rocket surgery.

    #!/usr/bin/perl -w use strict; my @data; while( <DATA> ) { my @row; while( /\G(?=.)/gc ) { my $val = undef; /\G\s*/gc; if( /\G'/gc ) { my $p = pos(); /\G(?:[^']+|'')*/gc; $val = substr( $_, $p, pos()-$p ); die "Unclosed '\n" if ! /\G'/gc; } elsif( /\G"/gc ) { my $p = pos(); /\G(?:[^"]+|"")*/gc; $val = substr( $_, $p, pos()-$p ); die "Unclosed \"\n" if ! /\G"/gc; } else { my $p = pos(); $val = substr( $_, $p, pos()-$p ) if /\G[^'",]+/gc; $val =~ s/\s*$//; } /\G\s*/gc; die "Bad data\n" if ! /\G(,|$)/gc; push @row, $val; } push @data, \@row; } __END__ 'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' '

    Yep, not hard; worked the first try. Took a few minutes to write.

    (Update: I neglected to post-process escaped quotation marks. Of course, no provision for escaped quotation marks was given in the original problem so I just implemented the simplest version, which might not be appropriate.)

    - tye        

Re: parsing malformed CSV with per column quote chars
by dasgar (Priest) on Aug 07, 2014 at 18:19 UTC

    Just did a quick search on metacpan for "tie hash csv". It looks like Tie::Handle::CSV might be doing something along the lines of what you want. After doing a quick scan through its documentation, it looks like it shouldn't be too difficult to create a hash using this module.

    If I understand your previously used method correctly, Tie::Array::CSV can easily parse out a CSV file into an AoA structure - or as I like to think of it, a two dimensional array. It also allows you to pass parameters to Tie::File and Text::CSV, which it uses to do all of the real work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1096662]
Approved by AppleFritter
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-19 03:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found