http://www.perlmonks.org?node_id=448594


in reply to Re: cut vs split (suggestions)
in thread cut vs split (suggestions)

Some of the slowness appears to be related to the split, rather than the join. On my system, this takes about 9s:
$ time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/ +null real 0m9.880s user 0m9.716s sys 0m0.034s
and your version takes a little less:
$ time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv > +/dev/null real 0m8.974s user 0m8.772s sys 0m0.042s
but this one avoiding both -a and join only takes about 3.4s:
$ time perl -ln -e 'print $1 if /((?:[^,]+,){14}[^,]+)/' numbers.csv > + /dev/null real 0m3.412s user 0m3.370s sys 0m0.031s

Replies are listed 'Best First'.
Re^3: cut vs split (suggestions)
by BrowserUk (Patriarch) on Apr 17, 2005 at 06:01 UTC

    I don't think it is split (which also used the regex engine), so much as it's the assignment to the (global) array.

    Avoiding that gets the time down from 37 seconds to just under 7 on my system.

    [ 6:55:11.07] P:\test>perl -lne"BEGIN{$,=','} print+(split',',<>)[0..1 +4] " data\25x500000.csv >junk [ 6:55:17.90] P:\test>

    Of course, that's only really useful if you only want to print them stright out again, but I guess it gets closer to being comparable with what cut does.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
      Very interesting! Thanks for the new idea!

      I lost my server connection for some reason and so I tested this on my laptop and I do see a very good improvement with the your modification.

      update: corrected <> with $_ per pijll post

      C:\>perl -lne "BEGIN{$,=','} print+(split',',$_)[0..14] " > junk
      this finishes in about 14 seconds.... corrected timing

      C:\>perl -lanF, -e "BEGIN{ $,=\",\"} print @F[0..14];" numbers.csv > j +unk
      this takes about 18 seconds

      I don't have a timing utility in Windows so the times are just wallclock times.

      I guess windows is faster because the process run at 100% CPU (or whatever is required i guess?). On the UNIX servers the process might be more time-shared?

      My laptop is 1.6G Centrino/1GB Ram/perl, v5.6.1

      cheers

      SK Update: Thanks pijll, the time it takes to run your version of the code is almost same as the one that uses -n.

        You are using both the -n switch and <> in the first line! This means you lose half of your lines...

        Anyway: -n does an unnecessary chomp on every line, so remove that; and use a limit on split: it doesn't actually need to split all 25 fields:

        perl -le 'BEGIN{$,=","} print+(split",",$_,16)[0..14]for <>' numbers.c +sv
        Update: But for<> reads all lines in at ones; you may not want that with large files, so use while <> instead.

      Now it is my machine that is the slowpoke; still, for me, the internal pipe continues to fare best:

      % time perl -lne 'BEGIN{ $,=","}; print+(split ",")[0..14]' numbers.cs +v \ > /dev/null 12.65s user 0.01s system 98% cpu 12.890 total % time perl -le 'open IN, q(cut -d, -f"1-15" numbers.csv|); \ print join ",", ( chomp and split /,/ ) while <IN>' > dev/null 8.17s user 0.01s system 90% cpu 9.070 total

      the lowliest monk

Re^3: cut vs split (suggestions)
by tlm (Prior) on Apr 17, 2005 at 06:37 UTC

    If your numbers.csv file is the same size as mine, then your perl is running 3X faster than mine. What's your configuration? Mine is perl v5.8.4 running on a Pentium 4 2.0GH 768MB laptop.

    the lowliest monk

      Mine's perl v5.8.4 (from Debian unstable) running on an Athlon64 3000+ (1.8GHz) with 1GB of memory.