Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Modifying CSV File

by roho (Bishop)
on Jun 16, 2015 at 16:07 UTC ( [id://1130628]=perlquestion: print w/replies, xml ) Need Help??

roho has asked for the wisdom of the Perl Monks concerning the following question:

In the code below, I am replacing the comma with a dash in the quoted string in each input line. The code works, but I'm wondering if there is a way to do this in one statement. I tried a few variations but the stumbling block was always that the "capture" variable $1 is read-only.

BTW, before someone mentions using Text::CSV_XS, I have a version that uses the module but it takes over 4 times longer to process. The real files I will be processing will be huge and I need all the speed I can squeeze out of this.

FYI: The reason I am doing this is because SQL Server BULK INSERT does not respect quoted values. It sees the comma in the quoted string and treats it as a field terminator. :(

#!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; m/"([^"]+)"/; # Extract quoted value. my $val = $1; # Save value in variable. $val =~ s/,/-/; # Change comma to dash. s/"[^"]+"/$val/; # Replace quoted value. print $_, "\n"; } __DATA__ 65722417,"1193,1",7980,1133566,4169735,035,FEDERAL UNIVERSAL SERVICE F +UND,0.12998 65722417,"1193,1",1012,1132900,4150053,C2,Carrier Cost Recovery Fee,0. +0273

"Its not how hard you work, its how much you get done."

Replies are listed 'Best First'.
Re: Modifying CSV File
by AnomalousMonk (Archbishop) on Jun 16, 2015 at 16:16 UTC
    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '65722417,\"1193,1\",1012,\"9,8,7\",C2,Carrier Cost Recovery +Fee,0.0273'; print qq{'$s'}; ;; $s =~ s{ (\" [^^\x22]+ \") }{ (my $one = $1) =~ s{,}{-}xmsg; $one; } +xmsge; print qq{'$s'}; " '65722417,"1193,1",1012,"9,8,7",C2,Carrier Cost Recovery Fee,0.0273' '65722417,"1193-1",1012,"9-8-7",C2,Carrier Cost Recovery Fee,0.0273'
    If you have Perl 5.14+, there is a slightly simpler version of the  s{,}{-}xmsg substitution that uses the  /r modifier. See  s/PATTERN/REPLACEMENT/msixpodualgcer in the Regexp Quote-Like Operators section of perlop.

    Update: I use  [^\x22] in the char class in the  (\" [^^\x22]+ \") capture of the substitution only because my REPL doesn't like unbalanced double-quote characters! Note also that in the Windows command line, embedded double-quotes must be escaped, and that certain other characters such as ^ must be conditionally escaped. So the true version of the regex capture sub-expression is  (" [^\x22]+ ") or more simply  (" [^"]+ ") instead. Also note that this regex does not handle escaped characters within the quoted string! Also note that  tr{,}{-} (or  tr{,}{-}r with 5.14+) might be a bit quicker.


    Give a man a fish:  <%-(-(-(-<

      Thanks! That worked great. I added a second substitute in the regex to remove the double quotes and I have the desired result.

      "Its not how hard you work, its how much you get done."

Re: Modifying CSV File
by Tux (Canon) on Jun 17, 2015 at 10:09 UTC

    I also am very curious what you did to make Text::CSV_XS be 4 time slower than your code.

    For what it is worth, I tried to come up with the most efficient Text::CSV_XS code to attack your situation, only to find there was a bug in the module, which is now fixed in version 1.19. The code would be something like this:

    use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, keep_meta_info => 11, }); my $row = $csv->getline (*DATA); # get the first line to count number +of fields my @row = @$row; $csv->bind_columns (\(@row)); # speed up all remaining fetches do { $row[$_] =~ tr{,}{-} for grep { $csv->is_quoted ($_) } 0..$#row; $csv->say (*STDOUT, \@row); } while ($csv->getline (*DATA)); __END__ 65722417,"1193,1",7980,1133566,4169735,035,FEDERAL UNIVERSAL SERVICE F +UND,0.12998 65722417,"1193,1",1012,1132900,4150053,C2,Carrier Cost Recovery Fee,0. +0273
    $ perl test.pl 65722417,"1193-1",7980,1133566,4169735,035,"FEDERAL UNIVERSAL SERVICE +FUND",0.12998 65722417,"1193-1",1012,1132900,4150053,C2,"Carrier Cost Recovery Fee", +0.0273

    Not being dynamic, not forcing fields with a space to be quoted, and using a file instead of data, that would be:

    use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, keep_meta_info => 11, }); open my $io, "<", "test.csv"; my @row = ("") x 8; # The CSV has 8 columns $csv->bind_columns (\(@row)); while ($csv->getline ($io)) { $row[$_] =~ tr{,}{-} for grep { $csv->is_quoted ($_) } 0..$#row; $csv->say (*STDOUT, \@row); }

    I'd be very interested in how much worse that performs on big datasets over your code.


    Enjoy, Have FUN! H.Merijn
      I used your code and it took 82 seconds to process app 1.4 million records. My code processes the same records in 17 seconds. Also, since I am using Perl 5.8.9 I replaced "say" with "print", which worked, however the output did not have newlines (i.e., the output was one long record). I tried adding "\n" as a third argument to $csv->print but it gave me an error, so I'm not sure how to get newlines in my output. Is that related to the bug you found?

      "Its not how hard you work, its how much you get done."

        the say method I used in the code is the say method of Text::CSV_XS and does not require a recent perl to do what it is supposed to do. It acts as if you have a second CSV objects that has eol => "\r\n". This will work back to perl-5.6.1. If both your input and output documents have the same line ending, set it in the constructor and use print instead.

        Can you provide us with sample data and the simple code you use? If your data is indeed (very) simple and has no embedded newlines or escapes, then your code could be optimized to skip all situations that Text::CSV_XS has to consider. In that case, Text::CSV::Easy_XS or Text::CSV::Easy_PP might be of interest, as they do not allow altering the settings like separation character or quotation. They however have no means to check if the original field was quoted, which was what you stated as a precondition for your request. Text::CSV_XS might also be faster if you drop that requirement and/or if you know in advance which columns are object to the substitution.

        If you are not interested in keeping the original quotation, and you only want to replace the comma in column 2 you can gain a lot. As an example test, I created a 1_400_000 line file with just the two lines from your data example and have two tests. The first keeps original quotation and replaces , to - in all quoted columns. The second ignores all quotation requirements and replaces all , with - just in the second column, still generationg correct CSV data:

        use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, keep_meta_info => 11, }); open my $io, "<", "test.csv"; open my $oh, ">", "out.csv"; my @row = ("") x 8; # The CSV has 8 columns $csv->bind_columns (\(@row)); while ($csv->getline ($io)) { $row[$_] =~ tr{,}{-} for grep { $csv->is_quoted ($_) } 0..$#row; $csv->say ($oh, \@row); }
        $ time perl test.pl 20.120u 0.156s 0:20.48 98.9% 0+0k 13888+214656io 0pf+0w
        use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n", }); open my $io, "<", "test.csv"; open my $oh, ">", "out.csv"; my @row = ("") x 8; $csv->bind_columns (\(@row)); while ($csv->getline ($io)) { $row[1] =~ tr{,}{-}; $csv->print ($oh, \@row); }
        $ time perl test2.pl 5.564u 0.144s 0:05.90 96.6% 0+0k 16+214656io 0pf+0w

        That is 3.6 times faster. YMMV


        Enjoy, Have FUN! H.Merijn
Re: Modifying CSV File
by Anonymous Monk on Jun 16, 2015 at 22:58 UTC

    BTW, before someone mentions using Text::CSV_XS, I have a version that uses the module but it takes over 4 times longer to process. The real files I will be processing will be huge and I need all the speed I can squeeze out of this.

    Oh really? I'd like to see that ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1130628]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-03-28 17:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found