http://www.perlmonks.org?node_id=448560


in reply to cut vs split (suggestions)

For your perl test, try using.

perl -lanf, -e 'BEGIN{ $,=','} print @F[0,14];' numbers.csv >out.csv

It may be that at least a part of the difference is the process of joining the values from the slice, before printing them. By setting $, = ',';, you acheive the same affect without forcing perl to build a single concatenation of the values before passing them to print.

I don't have a feel for how big a difference it will make, but it's worth trying.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?

Replies are listed 'Best First'.
Re^2: cut vs split (suggestions)
by dave0 (Friar) on Apr 17, 2005 at 05:47 UTC
    Some of the slowness appears to be related to the split, rather than the join. On my system, this takes about 9s:
    $ time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/ +null real 0m9.880s user 0m9.716s sys 0m0.034s
    and your version takes a little less:
    $ time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv > +/dev/null real 0m8.974s user 0m8.772s sys 0m0.042s
    but this one avoiding both -a and join only takes about 3.4s:
    $ time perl -ln -e 'print $1 if /((?:[^,]+,){14}[^,]+)/' numbers.csv > + /dev/null real 0m3.412s user 0m3.370s sys 0m0.031s

      I don't think it is split (which also used the regex engine), so much as it's the assignment to the (global) array.

      Avoiding that gets the time down from 37 seconds to just under 7 on my system.

      [ 6:55:11.07] P:\test>perl -lne"BEGIN{$,=','} print+(split',',<>)[0..1 +4] " data\25x500000.csv >junk [ 6:55:17.90] P:\test>

      Of course, that's only really useful if you only want to print them stright out again, but I guess it gets closer to being comparable with what cut does.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?
        Very interesting! Thanks for the new idea!

        I lost my server connection for some reason and so I tested this on my laptop and I do see a very good improvement with the your modification.

        update: corrected <> with $_ per pijll post

        C:\>perl -lne "BEGIN{$,=','} print+(split',',$_)[0..14] " > junk
        this finishes in about 14 seconds.... corrected timing

        C:\>perl -lanF, -e "BEGIN{ $,=\",\"} print @F[0..14];" numbers.csv > j +unk
        this takes about 18 seconds

        I don't have a timing utility in Windows so the times are just wallclock times.

        I guess windows is faster because the process run at 100% CPU (or whatever is required i guess?). On the UNIX servers the process might be more time-shared?

        My laptop is 1.6G Centrino/1GB Ram/perl, v5.6.1

        cheers

        SK Update: Thanks pijll, the time it takes to run your version of the code is almost same as the one that uses -n.

        Now it is my machine that is the slowpoke; still, for me, the internal pipe continues to fare best:

        % time perl -lne 'BEGIN{ $,=","}; print+(split ",")[0..14]' numbers.cs +v \ > /dev/null 12.65s user 0.01s system 98% cpu 12.890 total % time perl -le 'open IN, q(cut -d, -f"1-15" numbers.csv|); \ print join ",", ( chomp and split /,/ ) while <IN>' > dev/null 8.17s user 0.01s system 90% cpu 9.070 total

        the lowliest monk

      If your numbers.csv file is the same size as mine, then your perl is running 3X faster than mine. What's your configuration? Mine is perl v5.8.4 running on a Pentium 4 2.0GH 768MB laptop.

      the lowliest monk

        Mine's perl v5.8.4 (from Debian unstable) running on an Athlon64 3000+ (1.8GHz) with 1GB of memory.
Re^2: cut vs split (suggestions)
by tlm (Prior) on Apr 17, 2005 at 03:33 UTC

    Here are the numbers on my machine (first line describes the input used):

    % perl -le 'BEGIN{$,=","} print map int rand 1000, 1..25 for 1..500_00 +0' \ > numbers.csv % time cut -d, -f"1-15" numbers.csv > /dev/null 0.80s user 0.05s system 98% cpu 0.859 total % time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/ +null 31.54s user 0.06s system 97% cpu 32.462 total % time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv > +/dev/null 31.14s user 0.05s system 99% cpu 31.463 total
    (I guess I have much faster cut than sk's...)

    the lowliest monk

      Faster than mine also:

      [ 4:40:33.95] P:\test>cut -d, -f 1-15 data\25x500000.csv >nul [ 4:41:13.59] P:\test> [ 4:42:48.34] P:\test>perl -lanF, -e "BEGIN{ $,=','} print @F[0..14];" + data\25x500000.csv >nul [ 4:43:25.60] P:\test>

      40 seconds for cut versus 37 for Perl.

      That said, that time for your cut seems almost to good to be true. You are sure that cut can't somehow detect that it is writing to the null device and simply skip it--like perl sort detects a null context and skips?

      It's probably just a very well optimised, time-honed Unix utility versus a bad Win32 emulation, but 0.80s for 500,000 records is remarkable enough to make me check.

      I just remembered something I discovered a long time ago. The Win32 nul device is slower than writing to a file!?

      [ 4:53:24.51] P:\test>cut -d, -f 1-15 data\25x500000.csv >junk [ 4:53:38.01] P:\test>

      Actually writing the file cuts the 40 seconds to 14. Go figure.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?

        That said, that time for your cut seems almost to good to be true. You are sure that cut can't somehow detect that it is writing to the null device and simply skip it--like perl sort detects a null context and skips?

        As it happens, in the very first run of cut I tried, I sent the output to a file; and yes, I was pleasantly surprised to see how fast this cut was. But, what utility could there be for the optimization you describe? If there is one, I sure can't think of it. And why should a no-op take 0.9s?

        Anyway, FWIW:

        % time cut -d, -f"1-15" numbers.csv > out.csv 0.80s user 0.11s system 100% cpu 0.906 total % wc out.csv 500000 500000 29174488 out.csv % head -1 out.csv 169,970,983,721,411,426,262,255,484,174,389,651,175,975,763 % tail -1 out.csv 936,347,232,520,436,359,208,737,788,226,731,497,755,746,812

        the lowliest monk

        I was curious about the auto-detect in cut. So I tested it in on my machine again

        [sk]% time cut -d, -f"1-15" numbers.csv > junk 5.630u 0.260s 0:06.12 96.2% [sk]% time cut -d, -f"1-15" numbers.csv > /dev/null 5.620u 0.030s 0:05.65 100.0%
        I guess Pustular must be on a really fast machine :) -SK