comment on

This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.

It basically does two passes.

Read the file one line at a time and write each column to a separate file.
Then read those files in order, and accumulates the required data.
If the results set itself poses a memory problem, then the results could be written as they are accumulated.

#! perl -slw
use strict;
use constant TEMPNAME => 'temp,out.';

my @row = split ' ', scalar <>;
my @fhs;
open $fhs[ $_ ], '+>', TEMPNAME . $_ for 0 .. $#row;

print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;

while( <> ) {
    @row = split;
    print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;
}

my( $i, @cCounts, @iRows, @nonZs ) = ( 0, 0 );

for my $fh ( @fhs ) {
    seek $fh, 0, 0;
    my $count = 0;
    while( <$fh> ) {
        chomp;
        next unless 0+$_;
        ++$count;
        $iRows[ $i ] = $. - 1;
        $nonZs[ $i ] = $_;
        ++$i;
    }
    push @cCounts, $cCounts[ $#cCounts ] + $count;
}

print "@$_" for \( @cCounts, @iRows, @nonZs );

close $_ for @fhs;
unlink TEMPNAME . $_ for 0 .. $#fhs;

__END__
C:\test>791009 sample.dat
0 2 5 9 10 12
0 1 0 2 4 1 2 3 4 2 1 4
2 3 3 -1 4 4 -3 1 2 2 6 1
[download]

The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.

For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP PCW It is as I've been saying!(Audio until 20090817)

In reply to Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix by BrowserUk
in thread Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix by neversaint

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks