It turns out that BrowserUK's approach was an order of magnitude quicker than the other approaches presented here. I took the liberty of coding up Sundialsvc4's suggestion of buffering a few hundred thousand lines worth of data before printing them out to a file. I used array refs since there was no reason to use hash refs.
I used a Powershell script to time the different approaches.
echo 'BrowserUK 1024' > bench.txt
Measure-Command { .\buk.bat } >> bench.txt
echo 'BrowserUK 2048' >> bench.txt
Measure-Command { .\buk2048.bat } >> bench.txt
echo 'Sundial approach by Lotus1' >> bench.txt
Measure-Command { perl sundial.pl test.csv > dialout.csv} >> bench.txt
echo spacebar >> bench.txt
Measure-Command { .\sed -n "s/^\([^\t]*\t\)[^\t]*\t\([^\t]*\t\)[^\t]*\
+t[^\t]*\t\([^\t]*\)\t.*/\2\1\3/p" test.csv > spaceout.csv} >> bench.t
+xt
echo 'kenosis/choroba' >> bench.txt
Measure-Command { perl kc.pl test.csv > kcout.csv} >> bench.txt
echo 'mildside -- wrong order but testing anyway' >> bench.txt
Measure-Command { .\cut '-f1,3,6' test.csv > cutout.csv} >> bench.txt
Here are the results using a 1Gb test file on an idle server with 16 cores and 8Gb RAM: update: (Windows Server 2008 R2)
BrowserUK 1024
Minutes : 1
Seconds : 54
Milliseconds : 103
Ticks : 1141033211
BrowserUK 2048
Minutes : 1
Seconds : 55
Milliseconds : 124
Ticks : 1151241665
Sundial approach by Lotus1
Minutes : 21
Seconds : 53
Milliseconds : 28
Ticks : 13130283183
spacebar
Minutes : 21
Seconds : 24
Milliseconds : 215
Ticks : 12842154788
kenosis/choroba
Minutes : 22
Seconds : 4
Milliseconds : 865
Ticks : 13248658887
mildside -- wrong order but testing anyway
Minutes : 22
Seconds : 19
Milliseconds : 295
Ticks : 13392954883
Here is the sundialsvc4 approach that I put together for the test:
#! perl -sw
use strict;
my $count = 0;
my @lines;
while( <> ) {
$count++;
my @f = ( split /\t/, $_, 7 )[ 0, 2, 5 ];
push @lines, \@f;
if( $count >= 300000 ) { #size of buffer
$count = 0;
print_buffer( \@lines );
}
}
print_buffer( \@lines );
sub print_buffer {
my ($aref) = @_;
foreach (@$aref) {
print join( "\t", @$_ ) . "\n";
}
splice( @$aref );
}
Here is the Kenosis/choroba approach.
#! perl -sw
use strict;
#kenosis/choroba approach
while( <> ) {
print join( "\t", ( split /\t/, $_, 7 )[ 0, 2, 5 ]), "\n";
}
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.