This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.
It basically does two passes.
- Read the file one line at a time and write each column to a separate file.
- Then read those files in order, and accumulates the required data.
If the results set itself poses a memory problem, then the results could be written as they are accumulated.
#! perl -slw
use strict;
use constant TEMPNAME => 'temp,out.';
my @row = split ' ', scalar <>;
my @fhs;
open $fhs[ $_ ], '+>', TEMPNAME . $_ for 0 .. $#row;
print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;
while( <> ) {
@row = split;
print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;
}
my( $i, @cCounts, @iRows, @nonZs ) = ( 0, 0 );
for my $fh ( @fhs ) {
seek $fh, 0, 0;
my $count = 0;
while( <$fh> ) {
chomp;
next unless 0+$_;
++$count;
$iRows[ $i ] = $. - 1;
$nonZs[ $i ] = $_;
++$i;
}
push @cCounts, $cCounts[ $#cCounts ] + $count;
}
print "@$_" for \( @cCounts, @iRows, @nonZs );
close $_ for @fhs;
unlink TEMPNAME . $_ for 0 .. $#fhs;
__END__
C:\test>791009 sample.dat
0 2 5 9 10 12
0 1 0 2 4 1 2 3 4 2 1 4
2 3 3 -1 4 4 -3 1 2 2 6 1
The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.
For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.