neversaint has asked for the wisdom of the Perl Monks concerning the following question:
Dear Masters,
I have a text file that stores MxN (where M=3, N=4) matrix like this:
1 2 0 0
0 3 9 0
0 1 4 0
The actual dataset I have have much larger dimension, e.g.
M = 10^7
N = 10^2
In principle the Sparse Matrix implementation would give
us this result:
A = [ 1 2 3 9 1 4 ]
IA = [ 1 3 5 7 ]
JA = [ 1 2 2 3 2 3 ]
Let NNZ denote the number of nonzero entries of Matrix. The first array is A, which is of length NNZ, and holds all nonzero entries of M in lefttoright toptobottom order. The second array is IA, which is of length m + 1 (i.e., one entry per row, plus one). IA(i) contains the index in A of the first nonzero element of row i. Row i of the original matrix extends from A(IA(i)) to A(IA(i+1)1). The third array, JA, contains the column index of each element of A, so it also is of length NNZ.
Is there any implementation for handling such a large matrix
to give us A, IA and JA, without having have to slurp all the files into RAM?
I have tested slurping it, my 4GB ram fails.
The existing implementation with Math::MatrixSparse or PDL, all memory
consumptive.

neversaint and everlastingly indebted.......
Re: Memory Efficient Sparse Matrix Handling
by roboticus (Chancellor) on Jan 26, 2009 at 15:44 UTC

neversaint:
When I reach problems where the data is no longer willing to fit into the available RAM, I normally consider several alternatives:
 Stuffing the data into a database and express the operations in SQL, and letting a DB server do the hard lifting, or
 Examining the problem to see if it is amenable to having the arrays partitioned, operated upon and then recombined, or
 Examining the problem to see if it can be operated on in a stream fashion. (Example: IIRC, array multiplication can be done in a streaming fashion by scanning the array and tracking the partial results for rows & columns, and generating the results from the row & column partial results.)
...roboticus  [reply] 
Re: Memory Efficient Sparse Matrix Handling
by moritz (Cardinal) on Jan 26, 2009 at 15:51 UTC

Just today I sped up a C++ program significantly by using the ublas sparse matrices.
If memory is an issue, and even sparse matrix implementations in Perl don't help, I think it's not too bold of me to recommend another library and programming language. (Maybe you can even outsource the matrix operations with Inline::CPP, but I've never tried that; the module looks a bit outofdate...).  [reply] 

I personally use Inline::C when I need faster Perl. It runs fine and I haven't had any problems with it. Recently, I converted my Perl Mandelbrot program from doing the math in Perl to doing the math in C and cut the runtime from 35 minutes per run to about 2030 seconds.
The point is, I can vouch that Inline::C works, and it works well.
 [reply] 
Re: Memory Efficient Sparse Matrix Handling
by jethro (Monsignor) on Jan 26, 2009 at 17:25 UTC

As Javafan said there is no reason to read more than one number from the file at a time. To do this you might use (for example) the module File::Stream which allows you to read your array file one number at a time with
use File::Stream;
my $stream = File::Stream>new($filehandle);
$/ = qr/\s+/;
...
$nextnum= <$stream>;
UPDATE: Here is an example implementation that seems to work (but with harcoded n and m)
#!/usr/bin/perl
use warnings;
use strict;
use File::Stream;
my $m=3;
my $n=4;
my $filehandle;
open ($filehandle, '<', "testfile")  die "can't open the testfile";
my $stream = File::Stream>new($filehandle);
$/ = qr/\s+/;
my @a;
my @ia;
my @ja;
my $nnz=1;
my $x=0;
my $y=0;
my $value;
while ($value=<$stream>) {
$value=~s/\s//g; #because chomp doesn't work with File::Stream
$x++;
if ($y==0 or $x>$n) {
$x=1; $y++;
push @ia,$nnz;
}
if ($value) {
push @a, $value;
push @ja, $x;
$nnz++;
}
}
push @ia,$nnz;
if ($y!=$m and $x!=$m) {
print "Wrong number of values in array data\n";
}
print 'A= [ ', join(' ',@a),"]\n";
print 'IA= [ ', join(' ',@ia),"]\n";
print 'JA= [ ', join(' ',@ja),"]\n";
UPDATE2: Check out kennethk's version below. He reads one line at a time. If you can guarantee that n is lower than lets say 10^8 for any matrix you have to process, his version is probably a lot faster. But if you can't rule out matrices with bigger n, something like above has to be done to read even lines in smaller chunks.
 [reply] [d/l] [select] 
Re: Memory Efficient Sparse Matrix Handling
by JavaFan (Canon) on Jan 26, 2009 at 16:23 UTC

Is there any implementation for handling such a large matrix to give us A, IA and JA, without having have to slurp all the files into RAM?
As you describe A, IA and JA, it seems to me that each of them can be created by just scanning the matrix top to bottom; left to right. Since that means never having to slurp in more than the length of an element, I don't see why you would have to slurp in the file into memory.
Now, you still may have so many nonzero elements such that you don't have memory enough for A, IA and JA, but that's a different matter.
BTW, are you really counting starting from 1?
 [reply] 
Re: Memory Efficient Sparse Matrix Handling
by BrowserUk (Pope) on Jan 26, 2009 at 20:07 UTC

package Sparse;
sub new {
return bless [ [], [], [] ], $_[ 0 ];
}
sub populateFromFH {
my( $self, $fh ) = @_;
my( $a, $ia, $ja ) = @$self;
push @{ $ia }, 0;
while( <$fh> ) {
my @n = split;
my $first = 0;
1 until $n[ $first++ ];
my $len = $first;
1 while $n[ $len++ ];
$len;
$len = $first + 1;
push @{ $a }, @n[ $first .. $first + $len ];
push @{ $ja }, $first;
push @{ $ia }, scalar @{ $a };
}
}
sub getMN{
my( $self, $m, $n ) = @_;
my( $a, $ia, $ja ) = @$self;
my $first = $ja>[ $m ];
my $offset = $ia>[ $m ];
my $len = $ia>[ $m + 1 ]  $offset  1;
my @nonZeroN = @{ $a }[ $offset .. $offset + $len ];
return 0 if $n < $first or $n > $first + $len;
return $nonZeroN[ $n  $first ];
}
1;
And proof it works (for some definition ... blah):
#! perl slw
use strict;
use Sparse;
my $sa = Sparse>new;
$sa>populateFromFH( \*DATA );
close DATA;
for my $m ( 0 .. 2 ) {
print map $sa>getMN( $m, $_ ), 0 .. 3;
}
__DATA__
1 2 0 0
0 3 9 0
0 1 4 0
getNM() could be made more efficent by not extracting the whole row when you only want one value, or by returning the whole row suitably padded with leading and trailing zeros depending upon your usage requirements?
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] [select] 
Re: Memory Efficient Sparse Matrix Handling
by explorer (Chaplain) on Jan 26, 2009 at 17:55 UTC

With mapfraw() function, from PDL::IO::FastRaw module, you can memory map raw files from disk to memory.
Don't forget to set $PDL::BIGPDL variable to 1 to work with pdl larger of 1Gb.
Remember: the pdl mailing list is your friend (seriously).
 [reply] 
Re: Memory Efficient Sparse Matrix Handling
by kennethk (Abbot) on Jan 26, 2009 at 18:01 UTC

As long as this thread has gotten so popular, I'll throw in my 2 cents + TMTOWTDI. No matter what your final intent is, the problem you've presented is just shredding a spacedelimited file (as pointed out by JavaFan and jethro). I present my solution:
use strict;
use warnings;
my (@A, @IA, @JA) = ();
while (<DATA>) {
chomp;
my @elements = split /\s+/;
my $i = 0;
my $new_line = 1;
while (defined(my $element = shift @elements)) {
$i++;
if ($element) {
push @A, 0 + $element;
if ($new_line) {
push @IA, scalar @A;
$new_line = 0;
}
push @JA, $i;
}
}
}
push @IA, 1 + @A;
print('@A = [', join(" ", @A), "]\n");
print('@IA = [', join(" ", @IA), "]\n");
print('@JA = [', join(" ", @JA), "]\n");
__DATA__
1 2 0 0
0 3 9 0
0 1 4 0
As a side note, if you are planning on doing actual math on the results, you should plan on using seriously optimized math libraries (read LAPACK, ATLAS, Intel mkl...)  [reply] [d/l] 
Re: Memory Efficient Sparse Matrix Handling
by zentara (Archbishop) on Jan 26, 2009 at 18:35 UTC

I used to be pretty good at Matrices in school, but that was over 30 years ago and it's rusty but not forgotten. I do remember, you can reduce the rank of a matrix by decomposing it into a sum of smaller matrices multiplied by suitable row/column values, depending on how you break it down. So you might be able to convert your huge matrix into a long matrix equation of lower rank, then only load and solve as much of the equation as you can handle. This is probably the same method that roboticus talks about above.Also as a long shot, since matrices are a set of vectors, maybe Compact and sparse bit vector might be helpful in finding a way? Maybe you can use some sort of bit model to reduce the data?
 [reply] 
Re: Memory Efficient Sparse Matrix Handling
by andye (Curate) on Jan 26, 2009 at 17:50 UTC

You've already tried PDL's 'Sparse' module?
update: this is kind of offthewall, but you might also want to search cpan for RLE (runlengthencoding)  there's a bunch of imagerelated modules which can do rle for you.
update2: Also, see the rle() and rld() functions of pdl::slices.  [reply] 
Re: Memory Efficient Sparse Matrix Handling
by gone2015 (Deacon) on Jan 26, 2009 at 18:20 UTC

You didn't say how sparse the matrix was, or what form the nonzero elements take (integer in some range ? double float ? single float ?), or what operation(s) you want to perform on this matrix ?, or whether it's randomly sparse or something regular (diagonal, triangular, ...) ?
Using Perl arrays or hashes invokes some hefty overheads (particularly on 64bit machine). So, assuming the matrix is sparse enough and the data small enough, I would be tempted to do a little Inline::C to manage the data structure, the form of which would depend on the sparseness and the required operations... I can also testify to Inline::C being reasonably straightforward to use, at least on Linux  though I have recently managed it on Winders !
 [reply] [d/l] 
Re: Memory Efficient Sparse Matrix Handling
by scorpio17 (Canon) on Jan 26, 2009 at 17:16 UTC

I usually reach for
PDL in situations like this.
It allows you to program in perl, but all the hard stuff
is hidden in welloptimized clibraries.
 [reply] 

The raw data for a 10^7 x 10^2 array takes up 1_000_000_000 bytes of data per byte of the data type. For most data types, that's more than a 32bit machine can handle, regardless of whether C, PDL or Perl. That's why a sparse matrix was requested.
PDL could very well have such a data structure, but it's not "well optimized C libraries" that are going to make the difference.
 [reply] 

