Re^2: cut vs split (suggestions)
by dave0 (Friar) on Apr 17, 2005 at 05:47 UTC
|
Some of the slowness appears to be related to the split, rather than the join. On my system, this takes about 9s:
$ time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/
+null
real 0m9.880s
user 0m9.716s
sys 0m0.034s
and your version takes a little less:
$ time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv >
+/dev/null
real 0m8.974s
user 0m8.772s
sys 0m0.042s
but this one avoiding both -a and join only takes about 3.4s:
$ time perl -ln -e 'print $1 if /((?:[^,]+,){14}[^,]+)/' numbers.csv >
+ /dev/null
real 0m3.412s
user 0m3.370s
sys 0m0.031s
| [reply] [d/l] [select] |
|
I don't think it is split (which also used the regex engine), so much as it's the assignment to the (global) array.
Avoiding that gets the time down from 37 seconds to just under 7 on my system.
[ 6:55:11.07] P:\test>perl -lne"BEGIN{$,=','} print+(split',',<>)[0..1
+4] " data\25x500000.csv >junk
[ 6:55:17.90] P:\test>
Of course, that's only really useful if you only want to print them stright out again, but I guess it gets closer to being comparable with what cut does.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
| [reply] [d/l] |
|
C:\>perl -lne "BEGIN{$,=','} print+(split',',$_)[0..14] " > junk
this finishes in about 14 seconds.... corrected timing
C:\>perl -lanF, -e "BEGIN{ $,=\",\"} print @F[0..14];" numbers.csv > j
+unk
this takes about 18 seconds
I don't have a timing utility in Windows so the times are just wallclock times.
I guess windows is faster because the process run at 100% CPU (or whatever is required i guess?). On the UNIX servers the process might be more time-shared?
My laptop is 1.6G Centrino/1GB Ram/perl, v5.6.1
cheers
SK
Update: Thanks pijll, the time it takes to run your version of the code is almost same as the one that uses -n. | [reply] [d/l] [select] |
|
|
% time perl -lne 'BEGIN{ $,=","}; print+(split ",")[0..14]' numbers.cs
+v \
> /dev/null
12.65s user 0.01s system 98% cpu 12.890 total
% time perl -le 'open IN, q(cut -d, -f"1-15" numbers.csv|); \
print join ",", ( chomp and split /,/ ) while <IN>' > dev/null
8.17s user 0.01s system 90% cpu 9.070 total
| [reply] [d/l] |
|
| [reply] |
|
Mine's perl v5.8.4 (from Debian unstable) running on an Athlon64 3000+ (1.8GHz) with 1GB of memory.
| [reply] |
Re^2: cut vs split (suggestions)
by tlm (Prior) on Apr 17, 2005 at 03:33 UTC
|
% perl -le 'BEGIN{$,=","} print map int rand 1000, 1..25 for 1..500_00
+0' \
> numbers.csv
% time cut -d, -f"1-15" numbers.csv > /dev/null
0.80s user 0.05s system 98% cpu 0.859 total
% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > /dev/
+null
31.54s user 0.06s system 97% cpu 32.462 total
% time perl -lanF, -e 'BEGIN{ $,=","} print @F[0..14];' numbers.csv >
+/dev/null
31.14s user 0.05s system 99% cpu 31.463 total
(I guess I have much faster cut than sk's...)
| [reply] [d/l] |
|
[ 4:40:33.95] P:\test>cut -d, -f 1-15 data\25x500000.csv >nul
[ 4:41:13.59] P:\test>
[ 4:42:48.34] P:\test>perl -lanF, -e "BEGIN{ $,=','} print @F[0..14];"
+ data\25x500000.csv >nul
[ 4:43:25.60] P:\test>
40 seconds for cut versus 37 for Perl.
That said, that time for your cut seems almost to good to be true. You are sure that cut can't somehow detect that it is writing to the null device and simply skip it--like perl sort detects a null context and skips?
It's probably just a very well optimised, time-honed Unix utility versus a bad Win32 emulation, but 0.80s for 500,000 records is remarkable enough to make me check.
I just remembered something I discovered a long time ago. The Win32 nul device is slower than writing to a file!?
[ 4:53:24.51] P:\test>cut -d, -f 1-15 data\25x500000.csv >junk
[ 4:53:38.01] P:\test>
Actually writing the file cuts the 40 seconds to 14. Go figure.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
| [reply] [d/l] [select] |
|
That said, that time for your cut seems almost to good to be true. You are sure that cut can't somehow detect that it is writing to the null device and simply skip it--like perl sort detects a null context and skips?
As it happens, in the very first run of cut I tried, I sent the output to a file; and yes, I was pleasantly surprised to see how fast this cut was. But, what utility could there be for the optimization you describe? If there is one, I sure can't think of it. And why should a no-op take 0.9s?
Anyway, FWIW:
% time cut -d, -f"1-15" numbers.csv > out.csv
0.80s user 0.11s system 100% cpu 0.906 total
% wc out.csv
500000 500000 29174488 out.csv
% head -1 out.csv
169,970,983,721,411,426,262,255,484,174,389,651,175,975,763
% tail -1 out.csv
936,347,232,520,436,359,208,737,788,226,731,497,755,746,812
| [reply] [d/l] |
|
|
I was curious about the auto-detect in cut. So I tested it in on my machine again
[sk]% time cut -d, -f"1-15" numbers.csv > junk
5.630u 0.260s 0:06.12 96.2%
[sk]% time cut -d, -f"1-15" numbers.csv > /dev/null
5.620u 0.030s 0:05.65 100.0%
I guess Pustular must be on a really fast machine :)
-SK | [reply] [d/l] |
|
|