Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

swapping PIPE for comma in CSV file

by dwhite20899 (Friar)
on Jun 26, 2007 at 13:00 UTC ( #623388=perlquestion: print w/ replies, xml ) Need Help??
dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

I have some CSV data, and one field has embedded commas. I'd like to convert it to PIPE (|) separated data. I've come up with something that works, without needing to install extra modules, but there's got to be a better way.

Given the example data below, is there some concise way to do this? (not requiring module installation is a plus, but it's not out of the question)

"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3 +C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN","" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E +F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN","" "0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8 +7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC +7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""

Here's what I'm running now. I know that only the 4th field ($p [ 3 ]) may contain embedded commas. I split on commas; if none are embedded I join with PIPE. If some (possibly multiple) commas are embedded, I join the three leading fields with PIPE, the four trailing fields with PIPE, rejoin the 4th field with commas and print that.

perl -n -e '@p=split(/,/,$_); if (scalar(@p) == 8) { \ print join("|",@p);} else { $a = join("|",$p[0],$p[1],$p[2]); \ shift(@p); shift(@p); shift(@p); $c = join("|", $p[$#p-3],$p[$#p-2], + \ $p[$#p-1],$p[$#p]); pop(@p); pop(@p); pop(@p); pop(@p); \ print "$a|",join(",",@p),"|$c";}' A/NSRLFile.txt > /tmp/A-File.psv

Comment on swapping PIPE for comma in CSV file
Select or Download Code
Re: swapping PIPE for comma in CSV file
by pseudomonas (Monk) on Jun 26, 2007 at 13:24 UTC
    If your data is always going to have that number of fields, you can just work from the front and the back and comma-join all the bits in the middle.
    perl -F, -lane'chomp;print join("|",@F[0..2], join(",",  @F[3..$#F-4]),@F[-3..-1])' A/NSRLFile.txt
      Yeah - that's what I do now, in a VERY long-winded way. I didn't know about @F . Cool!
Re: swapping PIPE for comma in CSV file
by Tux (Monsignor) on Jun 26, 2007 at 13:31 UTC

    You could use Text::CSV_XS with two objects, and do it safe (untested):

    use strict; use warnings; use IO::Handle; use Text::CSV_XS; my $csv_in = Text::CSV_XS->new ({ binary => 1, }); my $csv_out = Text::CSV_XS->new ({ binary => 1, sep_char => "|", escape_char => "\\", eol => "\n", }); while (my $row = $csv_in->getline (*ARGV)) { $csv_out->print (*STDOUT, $row); }

    Enjoy, Have FUN! H.Merijn

      For the OP just to emphasize the importance of binary in the case you have a def. of CSV that permits embedded newlines.

      For Merijn.

      I was going to answer more or less the same to the OT yesterday, but came across a few problems, that made me reinstall the latest versions...

    • One problem was that I used IO::Wrap objects for stdin and stdout and they don't work with the pure perl version, I am not sure why. Maybe it would be better to load IO::Handle directly and have something for those who want efficiency. In this thread I wanted to test the pure perl version as installing an XS module could have been problematic for thew OP. I think that keeping in sync both versions is important...
    • for some reason search.cpan.org gives the version 0.29 Text::CSV_XS but perl -MCPAN -e install qw(Text::CSV_XS)' installs 0.30 the right one I believe if I remember your post on p5p or pm.

      % steph@apexPDell2 (/home/stephan) % % cat conv_comma2pipe_xs.px #!/usr/bin/perl use strict; use warnings; $|++; #use IO::Handle; use IO::Wrap; use Text::CSV_XS; # use DDS; # my $in = IO::Wrap::wraphandle(\*STDIN) or die; # my $out = IO::Wrap::wraphandle(\*STDOUT) or die; # Dump\($in, $out); my $csv_in = Text::CSV_XS->new({ binary => 1, }) or die; my $csv_out = Text::CSV_XS->new({ binary => 1, sep_char => q{|}, eol => qq{\n}, }) or die; while (defined (my $rec = $csv_in->getline(\*STDIN)) ) { { my @fields = @$rec; local $"=q{][}; print {\*STDERR} ".rec [@fields]\n"; } $csv_out->print(\*STDOUT, $rec); } __END__ % steph@apexPDell2 (/home/stephan) % % cat hi1.csv | perl+ -w conv_comma2pipe_xs.px .rec [a][b][c] a|b|c .rec [a][okay, comma][c] a|"okay, comma"|c .rec [a][long line, indeed][end] a|"long line, indeed"|end % steph@apexPDell2 (/home/stephan) % % cat hi1.csv a,b,c a,"okay, comma",c a,"long line, indeed",end
      cheers --stephan p.s I tested on cygwin with perl 5.8.7 and 5.8.8 update: oops forgot the code...
        One problem was that I used IO::Wrap objects for stdin and stdout and they don't work with the pure perl version, I am not sure why. Maybe it would be better to load IO::Handle directly and have something for those who want efficiency. In this thread I wanted to test the pure perl version as installing an XS module could have been problematic for thew OP. I think that keeping in sync both versions is important...

        The maintainer of Text::CSV_PP is doing a real nice job in trying to keep it in sync with Text::CSV_XS and we do have (a lot) of contact about that. I already had a look at version 1.06, and it passed all tests for 0.30, except the diagnostics tests, which is logical and explainable.

        That maintainer also got the maintainership for the very old Text::CSV, which will be a wrapper around Text::CSV_XS and Text::CSV_PP and choose the one available, based on a method used in DBI::PurePerl: the environment variable TEXT_CSV_XS, and will default to the fastest method available.

        I have been thinking about the use of IO::Handle, making it either default, or use'd automatically, but everything I came up with so far will imply a slowdown, which is IMHO unacceptable. I't a bit of a shame that this is a relative expensive module to load (14 kb of source code).

        for some reason search.cpan.org gives the version 0.29 Text::CSV_XS but perl -MCPAN -e install qw(Text::CSV_XS)' installs 0.30 the right one I believe if I remember your post on p5p or pm.

        Maybe I've been working too hard lately on this module, and uploaded too many versions :) Give CPAN some time to sync around the world.

        I'll have a look at the IO::Wrap thingy


        Enjoy, Have FUN! H.Merijn
Re: swapping PIPE for comma in CSV file
by pseudomonas (Monk) on Jun 26, 2007 at 13:37 UTC
    Alternatively, you can always use regexes. This is being silly, though.
    perl -lane'sub f{s/^([^,]+),/$1|/} sub b{s/,([^,]+)$/|$1/} f;f;f;b;b;b;b; print' A/NSRLFile.txt
    Like another commenter suggested, though, I'd use a module unless you really have a good reason not to.
Re: swapping PIPE for comma in CSV file
by tirwhan (Abbot) on Jun 26, 2007 at 13:45 UTC

    Don't reinvent the wheel, use Text::CSV_XS:

    #!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; my $csv = Text::CSV_XS->new(); while (<DATA>) { my $line = $csv->parse($_); print join("|",$csv->fields)."\n"; } __DATA__ "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3 +C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN","" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E +F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN","" "0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8 +7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC +7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""

    OTOH, why do this at all? Commas within quotation marks are common enough in CSV files and (as you've found out) are easy enough to handle.


    All dogma is stupid.
      Nice. I'll have to dig into CSV_XS.

      Yeah, I can handle it as is, but I'm getting asked to convert it for others who can't. It's due to a legacy spec that wasn't the greatest, but whatcha gonna do? (I got them to change the spec to put the three static width columns first, and paid for that with a lot of blood - not gonna ask for more)

Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 26, 2007 at 14:12 UTC
    I generally don't have the luxury of using anything not mentioned on perlfunc. This works for me without any special functions:
    #!/usr/bin/perl -w use strict; { print join ('|', (/(^"[^"]*"),("[^"]*"),("[^"]*"),("[^"]*"),([0-9] +*),([0-9]*),("[^"]*")/), "\"\"\n") while (<DATA>); } __DATA__ "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3 +C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN","" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E +F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN","" "0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8 +7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC +7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""
    It marks everything in a field, with the fourth field (third if zero based) being the variable-length one. I'm sure you could break it down even further using only perlfuncs but it's a working solution based on the provided input. This is my first post to perlmonks btw but I'm a longtime perl user.
Re: swapping PIPE for comma in CSV file
by jwkrahn (Monsignor) on Jun 26, 2007 at 14:14 UTC
    $ echo '"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D629 +96392FE3C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN","" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E +F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN","" "0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8 +7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC +7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""' | \ perl -nle'print join q/"|"/, split /","/' "00000142988AFA836117B1B572FAE4713F200567"|"9B3702B0E788C6D62996392FE3 +C9786A"|"05E566DF"|"J0180794.JPG",32768,3290,"WIN"|"" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB"|"8D025B6AE1994A40FCBB5AEC2E +F273F9"|"5E8D7D42"|"WabIab, and more.bor",4760,4616,"WIN"|"" "0000085FC602CD8AD4793A874A47D286DACB0F6A"|"8BA8BC04896C421A704282E9B8 +7B5520"|"8D89A85D"|"fpSDt,Finder,Link.gif",1161,2988,"Solaris"|"" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3"|"A5D49D6DA9D78FD1E7C32D58BC +7A46FB"|"2D729A1E"|"cmnres,pdb.dll",76800,1550,"WIN"|""
      You forgot what I initially forgot, and that's the numbers that don't have quotes. I think the end one line should look like:
      "|"WabIab, and more.bor"|4760|4616|"WIN"|""
      instead of:
      "|"WabIab, and more.bor",4760,4616,"WIN"|""
      But somebody can feel free to correct me if I'm wrong.
Re: swapping PIPE for comma in CSV file
by johngg (Abbot) on Jun 26, 2007 at 14:59 UTC
    Using split with a limit along with reverse seems to be a good way to isolate the fields in this particular case. However, I would agree with some other Monks recommendations to use a module in most circumstances. I have cut down the length of the fields to avoid too much line wrapping.

    use strict; use warnings; while ( <DATA> ) { my @f012 = split m{,}, $_, 4; my $rest = pop @f012; my @f7654 = map { my $r = reverse } split m{,}, reverse($rest), 5; my $f3 = pop @f7654; print join q{|}, @f012, $f3, reverse @f7654; } __END__ "13F2","E3C9","05E5","J0180794.JPG",32768,3290,"WIN","" "D05C","2EF2","5E8D","WabIab, and more.bor",4760,4616,"WIN","" "6DAC","B87B","8D89","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "4DE1","BC7A","2D72","cmnres,pdb.dll",76800,1550,"WIN",""

    The output.

    "13F2"|"E3C9"|"05E5"|"J0180794.JPG"|32768|3290|"WIN"|"" "D05C"|"2EF2"|"5E8D"|"WabIab, and more.bor"|4760|4616|"WIN"|"" "6DAC"|"B87B"|"8D89"|"fpSDt,Finder,Link.gif"|1161|2988|"Solaris"|"" "4DE1"|"BC7A"|"2D72"|"cmnres,pdb.dll"|76800|1550|"WIN"|""

    I hope this is of interest.

    Cheers,

    JohnGG

    Update: There's no need for the lexical in the map, $_ will do, and there's no need to pop off $f3 as we are not doing anything with it. Code becomes

    while ( <DATA> ) { my @f012 = split m{,}, $_, 4; my $rest = pop @f012; my @f76543 = map { $_ = reverse } split m{,}, reverse($rest), 5; print join q{|}, @f012, reverse @f76543; }

    Update 2: Using scalars for the first split means the pop can be avoided and incorporate all the rest in the print statement. Code becomes

    while ( <DATA> ) { my ($f0, $f1, $f2, $rest) = split m{,}, $_, 4; print join q{|}, $f0, $f1, $f2, reverse map { $_ = reverse } split m{,}, reverse($rest), 5; }
Re: swapping PIPE for comma in CSV file
by Ploux (Acolyte) on Jun 26, 2007 at 15:05 UTC
    Why not this?
    s/([0-9"]),([0-9"])/$1\|$2/g
      An embedded comma might sit between two numerals or at the beginning or end of the field next to the quotes.
        Somehow I knew it couldn't be that easy.
Re: swapping PIPE for comma in CSV file
by girarde (Friar) on Jun 26, 2007 at 15:15 UTC
    I dunno, assuming that the quotes aren't still desired I'd split on /","/, join on |, and trim the trailing and leading quotes.

    Maybe I'm missing something here.

      A few people missed this (myself included). You have to split on a pair of double quotes or no quotes at all. There's the field with commas in it which you can't replace with a pipe so a simple regex or split/join won't work (at least without a massive regex on the split).
        Right, but I'm splitting on the 'quote, comma, quote' pattern, so embedded commas sail through. You can split on a pattern, although join wants a string.
Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 26, 2007 at 20:49 UTC
    all you need is not_quote-comma-quote, quote-comma-quote, and quote-comma-not_quote:

    echo <DATA> | perl -F'(?<=[^"]),(?=")|(?<="),(?=")|(?<="),(?=[^"])' -lane 'print join("|",@F)'

    though i could be wrong, i just learned about look-ahead/behind today
    --jd
Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 27, 2007 at 09:20 UTC

    i would strongly agree with others who have recommended use of a standard, well tested csv-parsing module.

    however, if you really have to go commando (and do it on the command line to boot), the following (tested) code should work.

    note that:

    • NSRLFile.csv contains the example data from the op
    • NSRLFile.psv contains the processed, pipe character-separated data
    • the code makes no assumptions about the absence or presence whitespace around the value-separating commas or at the beginning or end of records
    • double-quoted strings can contain anything, even backslash-escaped double-quotes (or any other backslash-escaped character)
    • the only value supported other than a double-quoted string is a simple numeric with NO sign, decimal point or embedded commas
    • the code makes no assumptions about the order of the fields in a record
    • had to use \042 in place of a double-quote character because otherwise, the xp command-line interpreter gets confused and thinks the first pipe character it sees (in a regex) is an actual command-line pipe operator
    perl -wlpe "use strict; my $quo = qq(\042); my $esc = qq(\\\\); my $body = qr/[^$quo$esc]/; my $escaped = qr/$esc ./x; my $sep = qr/ \s* , \s* | \s* \Z /x; my $quoted = qr( $quo $body* (?: $escaped $body* )* $quo )x; my $number = qr(\d+); $_ = join '|', m/( $quoted | $number ) (?= $sep )/gx" NSRLFile.csv > NSRLfile.psv
Re: swapping PIPE for comma in CSV file
by Moron (Curate) on Jun 27, 2007 at 10:23 UTC
    I don't see why you can't just s/\",\"/\"|\"/g - embedded commas won't mess that up - even a single "," field won't mess up because the "," match at separator preceding it will take precedence and s/ will subsequently position after the executed substition at ,",", causing the next match to occur also at the correct place. (Update: Tested it now on "fred",",","bert" and it worked).

    I am not going to jump on the "always use a module" bandwagon, because only the OPer can possibly know what the issues are surrounding that - I seem to have to keep saying it in here: it is unsafe to draw conclusions when potentially missing crucial information.

    __________________________________________________________________________________

    ^M Free your mind!

      I don't see why you can't just s/\",\"/\"|\"/g

      Perhaps you should have taken a better look at the data in the original node then, or read the several answers that detail why that won't work in this case! (Hint: there are unquoted numbers in the data). Honestly, posting an answer to a day-old thread without even reading the preceding answers seems a complete waste of everyones time.


      All dogma is stupid.
Re: swapping PIPE for comma in CSV file
by LighthouseJ (Sexton) on Jun 27, 2007 at 14:24 UTC
    I posted the solution above with only using perlfunc. I just registered this login today.

    I found yet another solution, which is radically different from anything else I've seen.

    I got to thinking about my other comment correcting someone else that we should be concerned less with the commas and more with the quotes because they provide a more concrete "rule" for delimiting the items. I was writing an array definition for another file and noticed the definition was identical to the provided data, so I wrote a program which simply evals the data into an array, then prints it out in the right format. I'm sure somebody can write something to clean up the output and make it more elegant but the eval does a bang-up job of reading the data properly.

    #!/usr/bin/perl -w use strict; { while (<DATA>) { my @i = (); eval "\@i = ($_);"; print "\"$i[0]\"|\"$i[1]\"|\"$i[2]\"|\"$i[3]\"|$i[4]|$i[5]|\"$i[6] +\"|\"$i[7]\"\n"; } } __DATA__ "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3 +C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN","" "000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E +F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN","" "0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8 +7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris","" "00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC +7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""
    Hopefully this is useful to somebody.

    "The three principal virtues of a programmer are Laziness, Impatience, and Hubris. See the Camel Book for why." -- `man perl`
      oooo, oooo...

      perl -wlpe "s/(\042[^\042]*?\042)/'$1'/g; $_ = join '|', eval qq/($_)/" NSRLFile.csv > NSRLfile.psv
      see previous post for definitions of NSRLFile.* files.
      \042 is a " char.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://623388]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (17)
As of 2014-09-30 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (376 votes), past polls