Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?

by amelinda (Friar)
on Oct 31, 2001 at 22:35 UTC ( #122392=perlquestion: print w/replies, xml ) Need Help??

amelinda has asked for the wisdom of the Perl Monks concerning the following question:

So, I have this CSV file of data. Some numbers, some text. The first few lines look like:
'12342248','BAR ','12JUNK ','6532918 12" JUNK ON BAR-709/8709/811 ', +' ','A',01,'N',' ','01',000000,+000030.00,+000150.00,'I',' + ' '13121751','FOO ','13FUNK ','6715500 FOO FUNK-331/334 ', +' ','A',03,'N',' ','99',000000,+000134.00,+000275.00,'I',' + '
So, really, I can just strip off the 's and extra spaces and then split on the commas, right? That's what I currently have going. All seemed to be ok... But I am horribly wrong.

Further down, there are lines like...

'18592038','BL ','17BLE ','6715500 17" BLETCH BLE-333,334 ', +' ','A',01,'N',' ','00',000000,+000007.00,+000200.00,'R',' + '
Notice how one of the text fields has a few entries with an embedded comma. In some, other records, one of the other text fields has an embedded '. Now what?

Unpack seems to be the answer, but the documentation in the Camel (and what I've supersearched here) is pretty sketchy. Can anyone help? I'd really appreciate explanations that include how and why, so that I might be converted to the happy use of unpack.

  • Comment on parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
  • Select or Download Code

Replies are listed 'Best First'.
(tye)Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by tye (Sage) on Oct 31, 2001 at 22:47 UTC

    Something close to

    @fields= unpack "xA8x3A5x3A8x3A37x3A1x3A1x2A2x2A1x3A2x3A2x2A6x1" . "A10x1A10x2A1x3A16", $record;
    should work.

            - tye (but my friends call me "Tye")
      That makes some sense... A seems to be for text (ascii?)... though I'm not sure what the difference tween A and a is. But what're the x's for? Single chars? Punctuation? !text?

        "a$n" pulls out the next $n characters of text as a string. "A$n" does the same thing but strips spaces and "\0"s from the end. "x$n" skips $n bytes.

                - tye (but my friends call me "Tye")
Re (tilly) 1: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by tilly (Archbishop) on Oct 31, 2001 at 22:52 UTC
    Also note that Text::xSV will solve the problem without assuming fixed width.

    That solution also handles embedded returns correctly. (In case anyone ever pulls that on you.)

      Note that Text::xSV can't handle (quite reasonably) the unescaped embedded quote characters inside of quoted fields that amelinda's data appear to have.

              - tye (but my friends call me "Tye")
        It could be made to though. That could even be made configurable, though I would prefer it to default to not for the simple reason that I don't like it.

        A bigger problem is that I only consider " a quote character. I will need to check whether ' also qualifies for Microsoft products. If it does I should make it handle that as well. (They never bother writing it, so I didn't either.) If not, then I could make that configurable as well...

      Actually, I'm more interested in how it could handle the unescaped embedded comma in one of the fields.

      If only splitting (or Text::xSVing) would handle that extraneous comma, i could worry about stripping off the quote marks later. It was the combination the unescaped embedded comma, the unescaped embedded quote, and the fact that only some fields were marked off in quotes that led me down the road to unpack.

        Embedded commas aren't a problem for CSV. The whole point of putting quotes around fields in the CSV is so that embedded commas can be dealt with. Embedded quotes aren't a problem if they are escaped (by doubling them).

        If you are curious how this is done, then I encourage you to download Text::CSV and/or Text::xSV (or just look at Text::xSV locally) and look for yourself. (:

                - tye (but my friends call me "Tye")
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by amelinda (Friar) on Oct 31, 2001 at 22:44 UTC
    Bother. I should have mentioned an additional quirk.

    One reason I went with split is for the following construction:
    push (@items, [split(/,/, $line)]); where @items is a LoL.

    If I have to go with unpack, how should I go about pushing them into @items?

    Did I mention that this is not going to be on my server, so I can't rely on the presence of modules? :sigh:

    Update: go go gadget unpack. It works now.

    Thanks go especially to tye. I appreciate the other answers too (I'm definitely going to look into Text::ParseWords and Text::xSV for this in the future with cleaner data), but most of them had issues with either the the embedded comma or the embedded single quote.

    As for those who berated me (again) for not wanting to use modules, I submit that I looked to unpack because a) it looked like the only solution that would cope with the embedded delineators issue and b) I didn't really understand how unpack works and I knew I needed to learn about it... especially if i am ever to grok obfuscated code. :)

    Update 2: To clarify, these would be unescaped commas and quotemarks. Also, I didn't provide a line of data here that had an unescaped, embedded quotemark, but I do have some.

Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by MZSanford (Curate) on Oct 31, 2001 at 22:44 UTC
    You could use unpack similar to the following (as long as the fields are fixed width) :
    my $data = qq!'a','b, and c','d '!; my ($a,undef,$b,undef,$c) = unpack("a3a1a10a1a3",$data);

    If the fields are not fixed width, unpack may not be the bext way to go.
    i had a memory leak once, and it ruined my favorite shirt.
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by davorg (Chancellor) on Oct 31, 2001 at 23:39 UTC

    Sounds to me like Text::ParseWords will do the trick.

    use Text::ParseWords; my @items; while (<>) { push @items, [ quotewords(',', 0, $_) ]; }

    It's part of the standard Perl distribution.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you don't talk about Perl club."

      Note that this will have problems (quite reasonably) dealing with the unescaped single quotes inside of single-quoted fields that amelinda's data appears to have.

              - tye (but my friends call me "Tye")

        Hmm... well you're right that it would break on unescaped single quotes, but looking at amelinda's data I couldn't see any. Double quotes yes, but not single quotes. That's why I sugggested it.

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you don't talk about Perl club."

Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by Zaxo (Archbishop) on Oct 31, 2001 at 22:55 UTC

    chomp $line; my @record = split /','/, $line; substr($record[0],0,1) = ''; chop $record[-1];
    I threw in a variety of functions to enjoy.

    After Compline,
    Zaxo

      Hm. Close. I thought of that too. But that breaks on at least one field which is not enclosed with 's.
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by shotgunefx (Parson) on Nov 01, 2001 at 01:47 UTC
    I work with CSV files a LOT. I'd go with Text::CSV or Text::xSV personally.

    I don't know where the data is coming from, but it should be getting exported correctly in the first place. Even if you hack a solution with split or unpack, it's probably only a matter of time before a change in data causes your workaround to start failing again.

    -Lee

    "To be civilized is to deny one's nature."

Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by runrig (Abbot) on Nov 01, 2001 at 05:03 UTC
    unpack is a good answer as long as it's a Fixed Width format file and nothing else seems to work quite right since its not quite actual CSV. You might also want to try Parse::FixedLength, it might go something like this:
    use Parse::FixedLength: # Set field names and lengths # (you might want to have more meaningful names) :-) my $parser = Parse::FixedLength->new([ field1=>8, field2=>10, ...etc. ]); my @names = @{$parser->names}; while (<>) { my $data = $parser->parse($data); # Strip leading and trailing quotes and spaces? s/^'?\s+|\s+'?,$//g for values %$data; # Output real csv, e.g. s/"/""/g for values %$data; print join(",", map qq("$_"), @$data{@names}),"\n"; }
      Hi, I have tried your new method but it's some error message here : Can't locate object method "new" via package "Parse::FixLength" (perhaps you forgot to load "Parse::FixLength"?) at ./scal.pl line 18. Do you know what that's mean ?
        Can't locate object method "new" via package "Parse::FixLength"
        Did you install the module? If you installed it, you must have spelled it correctly once ('Parsed::FixedLength'). You just need to spell it correctly again.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://122392]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2020-10-20 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (208 votes). Check out past polls.

    Notices?