Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Memory Efficient Sparse Matrix Handling

by neversaint (Deacon)
on Jan 26, 2009 at 15:31 UTC ( #738959=perlquestion: print w/ replies, xml ) Need Help??
neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have a text file that stores MxN (where M=3, N=4) matrix like this:
1 2 0 0 0 3 9 0 0 1 4 0
The actual dataset I have have much larger dimension, e.g.
M = 10^7 N = 10^2
In principle the Sparse Matrix implementation would give us this result:
A = [ 1 2 3 9 1 4 ] IA = [ 1 3 5 7 ] JA = [ 1 2 2 3 2 3 ]

Let NNZ denote the number of nonzero entries of Matrix. The first array is A, which is of length NNZ, and holds all nonzero entries of M in left-to-right top-to-bottom order. The second array is IA, which is of length m + 1 (i.e., one entry per row, plus one). IA(i) contains the index in A of the first nonzero element of row i. Row i of the original matrix extends from A(IA(i)) to A(IA(i+1)-1). The third array, JA, contains the column index of each element of A, so it also is of length NNZ.

Is there any implementation for handling such a large matrix to give us A, IA and JA, without having have to slurp all the files into RAM?

I have tested slurping it, my 4GB ram fails. The existing implementation with Math::MatrixSparse or PDL, all memory consumptive.

---
neversaint and everlastingly indebted.......

Comment on Memory Efficient Sparse Matrix Handling
Select or Download Code
Re: Memory Efficient Sparse Matrix Handling
by roboticus (Canon) on Jan 26, 2009 at 15:44 UTC
    neversaint:

    When I reach problems where the data is no longer willing to fit into the available RAM, I normally consider several alternatives:

    • Stuffing the data into a database and express the operations in SQL, and letting a DB server do the hard lifting, or
    • Examining the problem to see if it is amenable to having the arrays partitioned, operated upon and then recombined, or
    • Examining the problem to see if it can be operated on in a stream fashion. (Example: IIRC, array multiplication can be done in a streaming fashion by scanning the array and tracking the partial results for rows & columns, and generating the results from the row & column partial results.)
    ...roboticus
Re: Memory Efficient Sparse Matrix Handling
by moritz (Cardinal) on Jan 26, 2009 at 15:51 UTC
    Just today I sped up a C++ program significantly by using the ublas sparse matrices.

    If memory is an issue, and even sparse matrix implementations in Perl don't help, I think it's not too bold of me to recommend another library and programming language. (Maybe you can even out-source the matrix operations with Inline::CPP, but I've never tried that; the module looks a bit out-of-date...).

      I personally use Inline::C when I need faster Perl. It runs fine and I haven't had any problems with it. Recently, I converted my Perl Mandelbrot program from doing the math in Perl to doing the math in C and cut the runtime from 3-5 minutes per run to about 20-30 seconds.

      The point is, I can vouch that Inline::C works, and it works well.

Re: Memory Efficient Sparse Matrix Handling
by JavaFan (Canon) on Jan 26, 2009 at 16:23 UTC
    Is there any implementation for handling such a large matrix to give us A, IA and JA, without having have to slurp all the files into RAM?
    As you describe A, IA and JA, it seems to me that each of them can be created by just scanning the matrix top to bottom; left to right. Since that means never having to slurp in more than the length of an element, I don't see why you would have to slurp in the file into memory.

    Now, you still may have so many non-zero elements such that you don't have memory enough for A, IA and JA, but that's a different matter.

    BTW, are you really counting starting from 1?

Re: Memory Efficient Sparse Matrix Handling
by scorpio17 (Monsignor) on Jan 26, 2009 at 17:16 UTC

    I usually reach for PDL in situations like this. It allows you to program in perl, but all the hard stuff is hidden in well-optimized c-libraries.

      The raw data for a 10^7 x 10^2 array takes up 1_000_000_000 bytes of data per byte of the data type. For most data types, that's more than a 32-bit machine can handle, regardless of whether C, PDL or Perl. That's why a sparse matrix was requested.

      PDL could very well have such a data structure, but it's not "well optimized C libraries" that are going to make the difference.

Re: Memory Efficient Sparse Matrix Handling
by jethro (Monsignor) on Jan 26, 2009 at 17:25 UTC
    As Javafan said there is no reason to read more than one number from the file at a time. To do this you might use (for example) the module File::Stream which allows you to read your array file one number at a time with
    use File::Stream; my $stream = File::Stream->new($filehandle); $/ = qr/\s+/; ... $nextnum= <$stream>;

    UPDATE: Here is an example implementation that seems to work (but with harcoded n and m)

    #!/usr/bin/perl use warnings; use strict; use File::Stream; my $m=3; my $n=4; my $filehandle; open ($filehandle, '<', "testfile") || die "can't open the testfile"; my $stream = File::Stream->new($filehandle); $/ = qr/\s+/; my @a; my @ia; my @ja; my $nnz=1; my $x=0; my $y=0; my $value; while ($value=<$stream>) { $value=~s/\s//g; #because chomp doesn't work with File::Stream $x++; if ($y==0 or $x>$n) { $x=1; $y++; push @ia,$nnz; } if ($value) { push @a, $value; push @ja, $x; $nnz++; } } push @ia,$nnz; if ($y!=$m and $x!=$m) { print "Wrong number of values in array data\n"; } print 'A= [ ', join(' ',@a),"]\n"; print 'IA= [ ', join(' ',@ia),"]\n"; print 'JA= [ ', join(' ',@ja),"]\n";

    UPDATE2: Check out kennethk's version below. He reads one line at a time. If you can guarantee that n is lower than lets say 10^8 for any matrix you have to process, his version is probably a lot faster. But if you can't rule out matrices with bigger n, something like above has to be done to read even lines in smaller chunks.

Re: Memory Efficient Sparse Matrix Handling
by andye (Curate) on Jan 26, 2009 at 17:50 UTC
    You've already tried PDL's 'Sparse' module?

    update: this is kind of off-the-wall, but you might also want to search cpan for RLE (run-length-encoding) - there's a bunch of image-related modules which can do rle for you.

    update2: Also, see the rle() and rld() functions of pdl::slices.

Re: Memory Efficient Sparse Matrix Handling
by explorer (Chaplain) on Jan 26, 2009 at 17:55 UTC

    With mapfraw() function, from PDL::IO::FastRaw module, you can memory map raw files from disk to memory.

    Don't forget to set $PDL::BIGPDL variable to 1 to work with pdl larger of 1Gb.

    Remember: the pdl mailing list is your friend (seriously).

Re: Memory Efficient Sparse Matrix Handling
by kennethk (Monsignor) on Jan 26, 2009 at 18:01 UTC

    As long as this thread has gotten so popular, I'll throw in my 2 cents + TMTOWTDI. No matter what your final intent is, the problem you've presented is just shredding a space-delimited file (as pointed out by JavaFan and jethro). I present my solution:

    use strict; use warnings; my (@A, @IA, @JA) = (); while (<DATA>) { chomp; my @elements = split /\s+/; my $i = 0; my $new_line = 1; while (defined(my $element = shift @elements)) { $i++; if ($element) { push @A, 0 + $element; if ($new_line) { push @IA, scalar @A; $new_line = 0; } push @JA, $i; } } } push @IA, 1 + @A; print('@A = [', join(" ", @A), "]\n"); print('@IA = [', join(" ", @IA), "]\n"); print('@JA = [', join(" ", @JA), "]\n"); __DATA__ 1 2 0 0 0 3 9 0 0 1 4 0

    As a side note, if you are planning on doing actual math on the results, you should plan on using seriously optimized math libraries (read LAPACK, ATLAS, Intel mkl...)

Re: Memory Efficient Sparse Matrix Handling
by oshalla (Deacon) on Jan 26, 2009 at 18:20 UTC

    You didn't say how sparse the matrix was, or what form the non-zero elements take (integer in some range ? double float ? single float ?), or what operation(s) you want to perform on this matrix ?, or whether it's randomly sparse or something regular (diagonal, triangular, ...) ?

    Using Perl arrays or hashes invokes some hefty overheads (particularly on 64-bit machine). So, assuming the matrix is sparse enough and the data small enough, I would be tempted to do a little Inline::C to manage the data structure, the form of which would depend on the sparseness and the required operations... I can also testify to Inline::C being reasonably straightforward to use, at least on Linux -- though I have recently managed it on Winders !

Re: Memory Efficient Sparse Matrix Handling
by zentara (Archbishop) on Jan 26, 2009 at 18:35 UTC
    I used to be pretty good at Matrices in school, but that was over 30 years ago and it's rusty but not forgotten. I do remember, you can reduce the rank of a matrix by decomposing it into a sum of smaller matrices multiplied by suitable row/column values, depending on how you break it down. So you might be able to convert your huge matrix into a long matrix equation of lower rank, then only load and solve as much of the equation as you can handle. This is probably the same method that roboticus talks about above.

    Also as a long shot, since matrices are a set of vectors, maybe Compact and sparse bit vector might be helpful in finding a way? Maybe you can use some sort of bit model to reduce the data?


    I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: Memory Efficient Sparse Matrix Handling
by BrowserUk (Pope) on Jan 26, 2009 at 20:07 UTC

    Here's a simple implementation:

    package Sparse; sub new { return bless [ [], [], [] ], $_[ 0 ]; } sub populateFromFH { my( $self, $fh ) = @_; my( $a, $ia, $ja ) = @$self; push @{ $ia }, 0; while( <$fh> ) { my @n = split; my $first = 0; 1 until $n[ $first++ ]; my $len = --$first; 1 while $n[ $len++ ]; --$len; $len -= $first + 1; push @{ $a }, @n[ $first .. $first + $len ]; push @{ $ja }, $first; push @{ $ia }, scalar @{ $a }; } } sub getMN{ my( $self, $m, $n ) = @_; my( $a, $ia, $ja ) = @$self; my $first = $ja->[ $m ]; my $offset = $ia->[ $m ]; my $len = $ia->[ $m + 1 ] - $offset - 1; my @nonZeroN = @{ $a }[ $offset .. $offset + $len ]; return 0 if $n < $first or $n > $first + $len; return $nonZeroN[ $n - $first ]; } 1;

    And proof it works (for some definition ... blah):

    #! perl -slw use strict; use Sparse; my $sa = Sparse->new; $sa->populateFromFH( \*DATA ); close DATA; for my $m ( 0 .. 2 ) { print map $sa->getMN( $m, $_ ), 0 .. 3; } __DATA__ 1 2 0 0 0 3 9 0 0 1 4 0

    getNM() could be made more efficent by not extracting the whole row when you only want one value, or by returning the whole row suitably padded with leading and trailing zeros depending upon your usage requirements?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://738959]
Approved by ikegami
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (14)
As of 2014-08-27 20:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (252 votes), past polls