It turns out that BrowserUK's approach was an order of magnitude quicker than the other approaches presented here. I took the liberty of coding up Sundialsvc4's suggestion of buffering a few hundred thousand lines worth of data before printing them out to a file. I used array refs since there was no reason to use hash refs.
I used a Powershell script to time the different approaches.
echo 'BrowserUK 1024' > bench.txt
Measure-Command { .\buk.bat } >> bench.txt
echo 'BrowserUK 2048' >> bench.txt
Measure-Command { .\buk2048.bat } >> bench.txt
echo 'Sundial approach by Lotus1' >> bench.txt
Measure-Command { perl sundial.pl test.csv > dialout.csv} >> bench.txt
echo spacebar >> bench.txt
Measure-Command { .\sed -n "s/^\([^\t]*\t\)[^\t]*\t\([^\t]*\t\)[^\t]*\
+t[^\t]*\t\([^\t]*\)\t.*/\2\1\3/p" test.csv > spaceout.csv} >> bench.t
+xt
echo 'kenosis/choroba' >> bench.txt
Measure-Command { perl kc.pl test.csv > kcout.csv} >> bench.txt
echo 'mildside -- wrong order but testing anyway' >> bench.txt
Measure-Command { .\cut '-f1,3,6' test.csv > cutout.csv} >> bench.txt
Here are the results using a 1Gb test file on an idle server with 16 cores and 8Gb RAM: update: (Windows Server 2008 R2)
BrowserUK 1024
Minutes : 1
Seconds : 54
Milliseconds : 103
Ticks : 1141033211
BrowserUK 2048
Minutes : 1
Seconds : 55
Milliseconds : 124
Ticks : 1151241665
Sundial approach by Lotus1
Minutes : 21
Seconds : 53
Milliseconds : 28
Ticks : 13130283183
spacebar
Minutes : 21
Seconds : 24
Milliseconds : 215
Ticks : 12842154788
kenosis/choroba
Minutes : 22
Seconds : 4
Milliseconds : 865
Ticks : 13248658887
mildside -- wrong order but testing anyway
Minutes : 22
Seconds : 19
Milliseconds : 295
Ticks : 13392954883
Here is the sundialsvc4 approach that I put together for the test:
#! perl -sw
use strict;
my $count = 0;
my @lines;
while( <> ) {
$count++;
my @f = ( split /\t/, $_, 7 )[ 0, 2, 5 ];
push @lines, \@f;
if( $count >= 300000 ) { #size of buffer
$count = 0;
print_buffer( \@lines );
}
}
print_buffer( \@lines );
sub print_buffer {
my ($aref) = @_;
foreach (@$aref) {
print join( "\t", @$_ ) . "\n";
}
splice( @$aref );
}
Here is the Kenosis/choroba approach.
#! perl -sw
use strict;
#kenosis/choroba approach
while( <> ) {
print join( "\t", ( split /\t/, $_, 7 )[ 0, 2, 5 ]), "\n";
}
|