Re^3: selecting columns from a tab-separated-values file

in reply to Re^2: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file

It turns out that BrowserUK's approach was an order of magnitude quicker than the other approaches presented here. I took the liberty of coding up Sundialsvc4's suggestion of buffering a few hundred thousand lines worth of data before printing them out to a file. I used array refs since there was no reason to use hash refs.

I used a Powershell script to time the different approaches.

echo 'BrowserUK 1024' > bench.txt
Measure-Command { .\buk.bat } >> bench.txt
echo 'BrowserUK 2048' >> bench.txt
Measure-Command { .\buk2048.bat } >> bench.txt
echo 'Sundial approach by Lotus1' >> bench.txt
Measure-Command { perl sundial.pl test.csv > dialout.csv} >> bench.txt
echo spacebar >> bench.txt
Measure-Command { .\sed -n "s/^\([^\t]*\t\)[^\t]*\t\([^\t]*\t\)[^\t]*\
+t[^\t]*\t\([^\t]*\)\t.*/\2\1\3/p" test.csv > spaceout.csv} >> bench.t
+xt
echo 'kenosis/choroba' >> bench.txt
Measure-Command { perl kc.pl test.csv > kcout.csv} >> bench.txt
echo 'mildside -- wrong order but testing anyway' >> bench.txt
Measure-Command { .\cut '-f1,3,6' test.csv > cutout.csv} >> bench.txt
[download]

Here are the results using a 1Gb test file on an idle server with 16 cores and 8Gb RAM: update: (Windows Server 2008 R2)

BrowserUK 1024
Minutes           : 1
Seconds           : 54
Milliseconds      : 103
Ticks             : 1141033211

BrowserUK 2048
Minutes           : 1
Seconds           : 55
Milliseconds      : 124
Ticks             : 1151241665

Sundial approach by Lotus1
Minutes           : 21
Seconds           : 53
Milliseconds      : 28
Ticks             : 13130283183

spacebar
Minutes           : 21
Seconds           : 24
Milliseconds      : 215
Ticks             : 12842154788

kenosis/choroba
Minutes           : 22
Seconds           : 4
Milliseconds      : 865
Ticks             : 13248658887

mildside -- wrong order but testing anyway
Minutes           : 22
Seconds           : 19
Milliseconds      : 295
Ticks             : 13392954883
[download]

Here is the sundialsvc4 approach that I put together for the test:

#! perl -sw
use strict;

my $count = 0;
my @lines;
while( <> ) {
    $count++;
    my @f = ( split /\t/, $_, 7 )[ 0, 2, 5 ];
    push @lines, \@f;
    
    if( $count >= 300000 ) {  #size of buffer
        $count = 0;
        print_buffer( \@lines );
    }
}
print_buffer( \@lines );

sub print_buffer {
    my ($aref) = @_;
    foreach (@$aref) {
        print join( "\t", @$_ ) . "\n";
    }
    splice( @$aref );
}
[download]

Here is the Kenosis/choroba approach.

#! perl -sw
use strict;

#kenosis/choroba approach

while( <> ) {
    print join( "\t", ( split /\t/, $_, 7 )[ 0, 2, 5 ]), "\n";
}
[download]

Comment on Re^3: selecting columns from a tab-separated-values file Select or Download Code

Replies are listed 'Best First'.
Re^4: selecting columns from a tab-separated-values file by mildside (Friar) on Jan 24, 2013 at 22:59 UTC
Great job there Lotus1. I'm curious about the use of splice to clear the array in your code, as below: `splice( @$aref );` I would probably have used the below: `@$aref = ();` Is splice faster or better in some other way?	[reply] [d/l] [select]
Re^5: selecting columns from a tab-separated-values file by Lotus1 (Vicar) on Jan 25, 2013 at 02:23 UTC
Thanks. I started with assigning the empty array but I had a bug somewhere so I stuck the splice in and got it working. I think I forgot the '@' in the first try but put it in with splice. I don't know which is faster but it is only called a handful of times in this approach anyway.	[reply]

In Section Seekers of Perl Wisdom