Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^2: Modifying CSV File

by roho (Canon)
on Jun 17, 2015 at 19:53 UTC ( #1130875=note: print w/replies, xml ) Need Help??

in reply to Re: Modifying CSV File
in thread Modifying CSV File

I used your code and it took 82 seconds to process app 1.4 million records. My code processes the same records in 17 seconds. Also, since I am using Perl 5.8.9 I replaced "say" with "print", which worked, however the output did not have newlines (i.e., the output was one long record). I tried adding "\n" as a third argument to $csv->print but it gave me an error, so I'm not sure how to get newlines in my output. Is that related to the bug you found?

"Its not how hard you work, its how much you get done."

Replies are listed 'Best First'.
Re^3: Modifying CSV File
by Tux (Abbot) on Jun 18, 2015 at 06:46 UTC

    the say method I used in the code is the say method of Text::CSV_XS and does not require a recent perl to do what it is supposed to do. It acts as if you have a second CSV objects that has eol => "\r\n". This will work back to perl-5.6.1. If both your input and output documents have the same line ending, set it in the constructor and use print instead.

    Can you provide us with sample data and the simple code you use? If your data is indeed (very) simple and has no embedded newlines or escapes, then your code could be optimized to skip all situations that Text::CSV_XS has to consider. In that case, Text::CSV::Easy_XS or Text::CSV::Easy_PP might be of interest, as they do not allow altering the settings like separation character or quotation. They however have no means to check if the original field was quoted, which was what you stated as a precondition for your request. Text::CSV_XS might also be faster if you drop that requirement and/or if you know in advance which columns are object to the substitution.

    If you are not interested in keeping the original quotation, and you only want to replace the comma in column 2 you can gain a lot. As an example test, I created a 1_400_000 line file with just the two lines from your data example and have two tests. The first keeps original quotation and replaces , to - in all quoted columns. The second ignores all quotation requirements and replaces all , with - just in the second column, still generationg correct CSV data:

    use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, quote_space => 0, keep_meta_info => 11, }); open my $io, "<", "test.csv"; open my $oh, ">", "out.csv"; my @row = ("") x 8; # The CSV has 8 columns $csv->bind_columns (\(@row)); while ($csv->getline ($io)) { $row[$_] =~ tr{,}{-} for grep { $csv->is_quoted ($_) } 0..$#row; $csv->say ($oh, \@row); }
    $ time perl 20.120u 0.156s 0:20.48 98.9% 0+0k 13888+214656io 0pf+0w
    use 5.20.0; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\n", }); open my $io, "<", "test.csv"; open my $oh, ">", "out.csv"; my @row = ("") x 8; $csv->bind_columns (\(@row)); while ($csv->getline ($io)) { $row[1] =~ tr{,}{-}; $csv->print ($oh, \@row); }
    $ time perl 5.564u 0.144s 0:05.90 96.6% 0+0k 16+214656io 0pf+0w

    That is 3.6 times faster. YMMV

    Enjoy, Have FUN! H.Merijn
      Here is the code I am using (with thanks to AnomalousMonk for the regex):

      while (<$fh1>) { chomp; next if $_ eq ''; s{ ("[^"]+") }{ (my $one = $1) =~ s{,}{-}xmsg; $one =~ s{"}{}g; $o +ne; }xmsge; print $_, "\n"; }

      The test file you made should be sufficient because the only thing I am changing is the comma to a dash and removing the quotes from the one column in question.

      "Its not how hard you work, its how much you get done."

        That ran in 4.194 on my dataset, which can be reduced by simplifying the regex even more.

        open my $io, "<", "test.csv"; open my $oh, ">", "out.csv"; while (<$io>) { s{ ("[^""]+") }{ (my $one = $1) =~ tr{,}{-}; $one =~ tr{""}{}d; $o +ne; }xge; print $oh $_; }

        runs in 3.229. All regex-based scripts will fail if

        • the first field is quoted;
        • the second field has a embedded double-quote (or an escaped character with the default " as escape)
        • any record anywhere in the dataset has an embedded newline, and the data after the newline has a double-quote

        As long as you are absolutely certain that the CSV data is uniformly and consistently laid out as in these two lines, you are safe.

        I would personally never take that risk, unless that two seconds are a problem. 5 seconds for 1.4 mln records is pretty fast, knowing it is always safe.

        Enjoy, Have FUN! H.Merijn

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1130875]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2019-12-06 06:41 GMT
Find Nodes?
    Voting Booth?
    Strict and warnings: which comes first?

    Results (154 votes). Check out past polls.