http://www.perlmonks.org?node_id=448559

sk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

This topic branches out from one of my other posts Efficient way to sum columns in a file. Since this topic is slightly different from the earlier one I am starting a new thread.

I tested two ways to cut columns from a delimited file. The first one being UNIX  cut and the other one was a simple Perl script. Unfortunately the Perl script performed poorly against the  cut utility. I ran the tests a few times to make sure they are statistically significant

Here is the timed test results -

[sk]% time cut -d, -f"1-15" numbers.csv > out.csv 5.670u 0.340s 0:06.27 95.8% [sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o +ut.csv 31.950u 0.200s 0:32.26 99.6%

The above test was done with 500,000 rows and 25 columns. The cut operation was performed to get the first 15 columns. The link above has code to generate random data (thanks to Random Walk).

As you could see the *my* perl script is not as good as UNIX cut. I have two questions here-

1. Can this script be improved so that it is comparable to the UNIX cut command in performance? If the Perl script can finish in 10 seconds that will be great (50% drop in peformance)! I am happy to take this performance drop because it keeps the script clean and portable (typically i work on UNIX machines so this is not a huge requirement)

2. If that is not possible, would you typically consider piping output from cut when the script does not require all the columns for processing? i.e. say the script only needs 3 columns instead of a possible 200 columns then would you pipe the 3 column output from cut instead of spliting the 200 columns in Perl and keeping only the 3 that is required?

I typically work with large files (~a few million rows by 500-800 columns).

Thanks in adavance for your thoughts!

cheers

SK

Replies are listed 'Best First'.
Re: cut vs split (suggestions)
by BrowserUk (Patriarch) on Apr 17, 2005 at 02:34 UTC

    For your perl test, try using.

    perl -lanf, -e 'BEGIN{ $,=','} print @F[0,14];' numbers.csv >out.csv

    It may be that at least a part of the difference is the process of joining the values from the slice, before printing them. By setting $, = ',';, you acheive the same affect without forcing perl to build a single concatenation of the values before passing them to print.

    I don't have a feel for how big a difference it will make, but it's worth trying.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
      Some of the slowness appears to be related to the split, rather than the join. On my system, this takes about 9s:
      $ time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/ +null real 0m9.880s user 0m9.716s sys 0m0.034s
      and your version takes a little less:
      $ time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv > +/dev/null real 0m8.974s user 0m8.772s sys 0m0.042s
      but this one avoiding both -a and join only takes about 3.4s:
      $ time perl -ln -e 'print $1 if /((?:[^,]+,){14}[^,]+)/' numbers.csv > + /dev/null real 0m3.412s user 0m3.370s sys 0m0.031s

        I don't think it is split (which also used the regex engine), so much as it's the assignment to the (global) array.

        Avoiding that gets the time down from 37 seconds to just under 7 on my system.

        [ 6:55:11.07] P:\test>perl -lne"BEGIN{$,=','} print+(split',',<>)[0..1 +4] " data\25x500000.csv >junk [ 6:55:17.90] P:\test>

        Of course, that's only really useful if you only want to print them stright out again, but I guess it gets closer to being comparable with what cut does.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?

        If your numbers.csv file is the same size as mine, then your perl is running 3X faster than mine. What's your configuration? Mine is perl v5.8.4 running on a Pentium 4 2.0GH 768MB laptop.

        the lowliest monk

      Here are the numbers on my machine (first line describes the input used):

      % perl -le 'BEGIN{$,=","} print map int rand 1000, 1..25 for 1..500_00 +0' \ > numbers.csv % time cut -d, -f"1-15" numbers.csv > /dev/null 0.80s user 0.05s system 98% cpu 0.859 total % time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/ +null 31.54s user 0.06s system 97% cpu 32.462 total % time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv > +/dev/null 31.14s user 0.05s system 99% cpu 31.463 total
      (I guess I have much faster cut than sk's...)

      the lowliest monk

        Faster than mine also:

        [ 4:40:33.95] P:\test>cut -d, -f 1-15 data\25x500000.csv >nul [ 4:41:13.59] P:\test> [ 4:42:48.34] P:\test>perl -lanF, -e "BEGIN{ $,=','} print @F[0..14];" + data\25x500000.csv >nul [ 4:43:25.60] P:\test>

        40 seconds for cut versus 37 for Perl.

        That said, that time for your cut seems almost to good to be true. You are sure that cut can't somehow detect that it is writing to the null device and simply skip it--like perl sort detects a null context and skips?

        It's probably just a very well optimised, time-honed Unix utility versus a bad Win32 emulation, but 0.80s for 500,000 records is remarkable enough to make me check.

        I just remembered something I discovered a long time ago. The Win32 nul device is slower than writing to a file!?

        [ 4:53:24.51] P:\test>cut -d, -f 1-15 data\25x500000.csv >junk [ 4:53:38.01] P:\test>

        Actually writing the file cuts the 40 seconds to 14. Go figure.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
Re: cut vs split (suggestions)
by graff (Chancellor) on Apr 17, 2005 at 03:33 UTC
    Can this script be improved so that it is comparable to the UNIX cut command in performance? If the Perl script can finish in 10 seconds that will be great (50% drop in peformance)! I am happy to take this performance drop because it keeps the script clean and portable (typically i work on UNIX machines so this is not a huge requirement)
    I don't think there's any way to speed up the perl approach. (I tried BrowserUK's idea -- not a rigorous benchmark, but no evidence that it made any difference.) I just have two reactions to your comments:

    (1) The unix-style "cut" is portable -- you can find free ports of unix command line utils for ms-windows, and macosx is unix, and "cut" behaves the same everywhere. What more portability do you need?

    (2) The reason to choose a perl approach over a common, compiled utility would be that the perl approach makes it a lot easier to provide a lot more flexibility, and the performance hit is a small price to pay for the extra power. I wrote my own perl version of cut years ago and use it all the time (as well as using the original "cut" when it seems quicker), because with perl I get to use a regex for the split, and output the columns in whatever order I choose, and have the input field separator be different from the ouput field separator (e.g. using "\n" to output one field per line), and insert arbitrary quoted strings between columns when this is convenient, and ... anything else I feel like doing, because perl makes it easy to do. Compared to the time it would take to work around the limitations of standard "cut", perl makes things really efficient.

    ... would you typically consider piping output from cut when the script does not require all the columns for processing? i.e. say the script only needs 3 columns instead of a possible 200 columns then would you pipe the 3 column output from cut instead of spliting the 200 columns in Perl and keeping only the 3 that is required?
    Would a bear typically consider defecating in its natural habitat? If processing 3 columns out of 200 were something I intended to do with any regularity, I would probably write and save a perl script that does something like:
    open( IN, "cut -d, -f13-15 numbers.csv |" ); while (<IN>) { chomp; @row = split /,/; # this is only cols 13, 14, 15 from numbers.csv # and now, do something }
    (update: naturally, I would have this perl script accept command-line options to specify the field separator and column selections for running the "cut" command, assuming this sort of flexibility were useful.)
      Thanks Everyone!

      Sorry I was not very clear on couple of things. When I meant piping, yes I had the pipe inside my open in mind. I did not realize that I used a command line example where pipe means  cut -d, ... | perl ... :)

      to graff's comment... I used the word portability very loosely and it is my fault... I sometimes write small utilites for my co-workers and wanted these utilities to work on other OSes...even though i could find a version of cut for windows (actually i did not know about this until after reading your post), i will not be able to install it on other's machines for policy reasons :(

      I did not think about the print idea (even though it does not make a diff in the runtime), I feel such small improvements can help!!!

      I agree with you all that it definitely worth the effort to have the task done in Perl as it gives enormous amount of flexibility.

      Thanks again all for your wonderful comments!

      cheers

      SK

        ... even though i could find a version of cut for windows ... i will not be able to install it on other's machines for policy reasons :(
        Um, do you mean that you can give your friends perl scripts to run, but you can't give them a "cut.exe" file? That's some wierd policy.
      The unix-style "cut" is portable -- you can find free ports of unix command line utils for ms-windows, and macosx is unix, and "cut" behaves the same everywhere. What more portability do you need?
      I just had a silly idea : what about using something like Inline::C to integrate the part of the 'cut' utility you need to the script?
Re: cut vs split (suggestions)
by jZed (Prior) on Apr 17, 2005 at 05:47 UTC
    Well, there's always that middle ground between C and perl - XS modules. Cut probably still beats it, but Text::CSV_XS appears to be faster than split for files of that size. And has the added advantage of supporting embedded newlines and embedded delimiters.

    For a 1 million row table of 100 columns, sum the second column:

                 s/iter        Split Text::CSV_XS
    Split          49.8           --          -8%
    Text::CSV_XS   45.9           8%           --
    
Re: cut vs split (suggestions)
by tlm (Prior) on Apr 17, 2005 at 02:51 UTC

    I'd be very surprised if a pure-Perl script beat a native utility like cut.

    If cutting out of the columns of these large files is a bottleneck, your options include:

    • As you mentioned, piping the output of cut to your script;
    • Running cut as a pipe from within your script:
      open my $cutter, "-|", 'cut', '-d,', '-f"1-15"', $mongo or die "Fork failed: $!\n"; go_to_town( $_ ) while <$cutter>; close $cutter;
      See perlipc.
    • Write a Perl extension module (XS, Inline::C, Swig) to extract the rows.

    I like the second option best.

    the lowliest monk

      Once you read each line of output from cut via the piped open, you are still going to have to split it to an array in order to utilise the fields, so I think most if not all the performance advantage of using cut will be lost, though spitting 15 fields cut from 200 rather than all 200 may help.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?

        The internal pipe approach is about 1.5X faster than the pure Perl approach (though still a far cry from cut):

        % time perl -le 'open IN, q( cut -d, -f"1-15" numbers.csv| ); \ print join ",", ( chomp and @F = split /,/ ) while <IN>' > /dev/null 19.49s user 0.00s system 96% cpu 20.289 total

        Update: But keep in mind that the numbers above are for a relatively fast cut command. The improvement with sk's cut will be more modest; it'd be interesting to see the actual numbers.

        the lowliest monk

Re: cut vs split (suggestions)
by demerphq (Chancellor) on Apr 17, 2005 at 10:25 UTC

    I think you could speed up the perl code somewhat by not using the autosplit function.

    perl -lne 'print join ",", (split /,/,$_,16)[0..14];' numbers.csv

    Should be slightly faster as it need not split the additional 10 fields out before rejoining.

    ---
    demerphq

Re: cut vs split (suggestions)
by Mabooka-Mabooka (Sexton) on Apr 17, 2005 at 19:53 UTC
    Mabooka % head -1 numbers.csv 462,393,252,996,663,603,344,439,139,259,879,766,545,192,477,986,317,77 +,611,303,79,742,190,556,538 Mabooka % wc -l numbers.csv 500000 numbers.csv Mabooka % Mabooka % time perl -lanF, -e 'print join ",", @F[0..4];' numbers.csv +> f1 27.820u 0.100s 0:27.92 100.0% 0+0k 0+0io 320pf+0w Mabooka % time cut -d, -f"1-5" numbers.csv > f2 1.860u 0.100s 0:01.96 100.0% 0+0k 0+0io 100pf+0w Mabooka % diff f1 f2 Mabooka %
    So it's clear what to use (if it's a bottleneck problem rather than an academic disput).

    Now, back to the original problem (sum up columns): on my system, for 500,000 it's negligible, so I tried with 5,000,000 x 25 cols:
    Mabooka % time perl -nle'my($a,$b,$c,$d,$e)=split /,/;$ta+=$a, $tb+=$b +, $tc+=$c, $td+=$d, $te+=$e;END{print join " ", $ta,$tb,$tc,$td,$te}' + numbers.csv 2499084140 2499188390 2500073650 2497725180 2495867770 45.270u 0.200s 0:45.44 100.0% 0+0k 0+0io 322pf+0w Mabooka % Mabooka % time sum5.cut_n_awk 2499084140 2499188390 2500073650 2497725180 2495867770 18.520u 0.490s 0:12.52 151.8% 0+0k 0+0io 575pf+0w Mabooka %
    , where:
    Mabooka % cat sum5.cut_n_awk # cat numbers.csv | cut -f1,2,3,4,5 -d, |awk -F, '{s1 += $1; s2 += $2; s +3+= $3; s4+=$4; s5+=$5} END {printf ("%.0f %.0f %.0f %.0f %.0f\n", s +1, s2,s3,s4,s5)}'

    3-4 times difference isn't bad. Maybe this would help...
        LOL:-). Thanks for the article: haven't seen it.

        Actually, the main purpose of cat for me is not to memorize all keys of grep. E.g:
        Mabooka % grep -w awk *awk* | sort | uniq | wc -l 10 Mabooka % cat *awk* | grep -w awk | sort | uniq | wc -l 8 Mabooka %
        It's also good for the prev. command reuse. As for the performance implication: it's minimal (smb. has to do disk i/o anyway).
Re: cut vs split (suggestions)
by cees (Curate) on Apr 18, 2005 at 03:51 UTC

    Lots of examples in this thread, but everyone seems to be using a split and join of some sort. I doubt cut will do this much work just to print out the first 15 columns.

    When looking for speed the first thing I usually look to is removing the need for any regexps where index and substr will do just as well! So why not just look for the 15th ',' and print everything before that?

    while (<>) { $col = 15; $index = index($_, ',', $index+1) while ($col--); print substr($_, 0, $index), $/; $index = 0; }

    This doesn't do any unnecesary string manipulation, and avoids expensive regexps as well. It could easily be extended to not start at the first column.

    This proved faster than some of the examples above that I tried out (I didn't try them all).

    $ time perl cut.pl numbers.csv > /dev/null real 0m5.577s user 0m4.792s sys 0m0.055s $ time cut -d, -f"1-15" numbers.csv > /dev/null real 0m1.081s user 0m0.866s sys 0m0.042s

    - Cees

      I am Perl-illiterate, but:

      how is 5 faster than 1?
      And how possibly "index" / "substring" could be faster than smth. like:
      char* cut(char* pStr, int col, char* delim=",") { // not a real-life code, all good assumptions here: int n = 0; while(*pStr++){ if(*pStr==*delim){ if(++n == col){ return until_next_one(++pStr, delim); } } } }
      --?
      Sorry if I missed the point.

        I wasn't implying that my perl version was faster than the C version. I think it is save to say that for every perl program, there is a C program that can perform the same task in less time.

        I was stating that using index and substr where you can is almost always faster than using a regexp, so my index and substr version was faster than the join and split versions listed above. So I was comparing different perl implementations.

        I guess I probably should have timed one of the split/join examples and included that, but I didn't know which one was the fastest. So I included the timing based on the C version of cut, which gives a baseline for anyone to compare against.

        - Cees