Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix

by neversaint (Deacon)
 on Aug 25, 2009 at 08:25 UTC Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have the following sparse matrix A.
```       2   3   0   0   0
3   0   4   0   6
0  -1  -3   2   0
0   0   1   0   0
0   4   2   0   1
Then I would like to capture the following information from there:
1. cumulative count of entries, as matrix is scanned columnwise. Yielding:   Ap = [ 0, 2, 5, 9, 10, 12 ];
2. row indices of entries, as matrix is scanned columnwise. Yielding:  Ai = [0, 1, 0,  2, 4, 1,  2, 3, 4, 2, 1, 4 ];
3. Non-zero matrix entries, as matrix is scanned columnwise. Yielding: Ax = [2, 3, 3, -1, 4, 4, -3, 1, 2, 2, 6, 1];

Since the actual matrix A is potentially very2 large, is there any efficient way in Perl that can capture those elements? Especially without slurping all matrix A into RAM.

I am stuck with the following code. Which doesn't give what I want.
```    use strict;
use warnings;

my (@Ax, @Ai, @Ap) = ();
while (<>) {
chomp;
my @elements = split /\s+/;
my \$i = 0;
my \$new_line = 1;
while (defined(my \$element = shift @elements)) {
\$i++;
if (\$element) {
push @Ax, 0 + \$element;
if (\$new_line) {
push @Ai, scalar @Ax;
\$new_line = 0;
}

push @JA, \$i;
}
}
}
push @Ai, 1 + @Ax;
print('@Ax  = [', join(" ", @Ax), "]\n");
print('@Ai = [', join(" ", @Ai), "]\n");
print('@Ap = [', join(" ", @Ap), "]\n");

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix
by gone2015 (Deacon) on Aug 25, 2009 at 09:06 UTC

I'm guessing the array is held in rows. I suggest reading your the non-zeros of the matrix into an array of arrays and sorting by column & row number, along these lines (untried, untested):

```  my @sparse = () ;
while (...whatever...) {
my \$value = ... ;  # next value
my \$col   = ... ;  # in this column
my \$row   = ... ;  # and this row
if (\$value) {      # keep only the non-zeros
push @sparse, [\$col, \$row, \$value] ;
} ;
} ;

@sparse = sort { (\$\$a[0] <=> \$\$b[0]) || (\$\$a[1] <=> \$\$b[1]) || die }
+ @sparse ;
The array of arrays can now be scanned to get the cumulative column count Ap, and is in the order you require for Ai and Ax.

If there are too many non-zero values to be held in memory, then you have a serious problem -- but you could partition it by reading the input 'n' times, processing a batch of columns each time.

If 'n' is big, then you'll need to consider ways to reduce the size of the array of arrays... but that's going to require a deeper understanding of the original data.

Update: corrected the sort line and added proof of concept code:

```use strict ;
use warnings ;

my @sparse = () ;
my \$col = 0 ;  # emerges from loop set to # of columns
my \$row = 0 ;  # emerges from loop set to # of rows

while (<DATA>) {
s/\s+/ /g ; s/^ // ; s/ \$// ;
\$col = 0 ;
foreach my \$val (split / /) {
if (\$val) { push @sparse, [\$col, \$row, \$val] ; } ;
++\$col ;
} ;
++\$row ;
} ;

@sparse = sort { (\$\$a[0] <=> \$\$b[0]) || (\$\$a[1] <=> \$\$b[1]) || die } @
+sparse ;

my \$Ap = 0 ;
print "Ap = \$Ap" ;
my \$c = 0 ;
foreach my \$a (@sparse) {
while (\$c < \$\$a[0]) {
print " \$Ap" ;
++\$c ;
} ;
++\$Ap ;
} ;
while (\$c++ < \$col) { print " \$Ap" ; } ;
print "\n" ;

print "Ai =" ;
foreach my \$a (@sparse) {
print " \$\$a[1]" ;
} ;
print "\n" ;

print "Ax =" ;
foreach my \$a (@sparse) {
print " \$\$a[2]" ;
} ;
print "\n" ;

__DATA__
2   3   0   0   0
3   0   4   0   6
0  -1  -3   2   0
0   0   1   0   0
0   4   2   0   1
output:
```Ap = 0 2 5 9 10 12
Ai = 0 1 0 2 4 1 2 3 4 2 1 4
Ax = 2 3 3 -1 4 4 -3 1 2 2 6 1
```

Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix
by BioLion (Curate) on Aug 25, 2009 at 09:36 UTC

It is easy enough to manipulate matrices once you have the whole thing in memory (Math::Matrix), but i don't know if there is a clever way of reading by rows and calculating column scores on-the-fly like you want.

Here is one way of at least stripping out the non-zero elements, storing them column-wise and then joining them into one array and printing them out. I would have thought it would be easier to keep the values and indices (@Ai and @Ax) as array refs anyways, it would make it easier to reconstitute the matrix later one if you wanted to?

Anyways, here is my attempt at the problem, as usual, i am sure there are more elegant ways, but KISS always make more sense to me...

```#! usr/bin/perl

use strict;
use warnings;
use diagnostics;

# 1.   cumulative count of entries, as matrix is scanned columnwise.
#      Yielding:   Ap = [ 0, 2, 5, 9, 10, 12 ];
# 2. row indices of entries, as matrix is scanned columnwise.
#      Yielding:  Ai = [0, 1, 0,  2, 4, 1,  2, 3, 4, 2, 1, 4 ];
# 3. Non-zero matrix entries, as matrix is scanned columnwise.
#      Yielding: Ax = [2, 3, 3, -1, 4, 4, -3, 1, 2, 2, 6, 1];

my (@Ai, @Ax, @Ap,) = ();
while (<DATA>) {

chomp;
my @elements = split /\s+/;
## process row-wise
for (0 .. \$#elements){
## store non-zero elements by column
if (\$elements[\$_] != 0){
## store values by column
push @{ \$Ax[\$_] }, \$elements[\$_];
## store column index (minus one to make it start at zero)
push @{ \$Ai[\$_] }, (\$. -1);
## increment column totals
++\$Ap[\$_];
}
}
}
## make cumulative totals
unshift @Ap, 0;
for (1 .. \$#Ap){
\$Ap[\$_] +=\$Ap[\$_-1];
}
## join array refs together
@Ai = map{ @\$_ } @Ai;
@Ax = map{ @\$_ } @Ax;

print('@Ax  = [', join(" ", @Ax), "]\n");
print('@Ai = [', join(" ", @Ai), "]\n");
print('@Ap = [', join(" ", @Ap), "]\n");

__DATA__
2   3   0   0   0
3   0   4   0   6
0  -1  -3   2   0
0   0   1   0   0
0   4   2   0   1

Gives :

```~ \$  perl Desktop/test.pl
@Ax  = [2 3 3 -1 4 4 -3 1 2 2 6 1]
@Ai = [0 1 0 2 4 1 2 3 4 2 1 4]
@Ap = [0 2 5 9 10 12]

HTH

Just a something something...
Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix
by BrowserUk (Pope) on Aug 25, 2009 at 15:19 UTC

This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.

It basically does two passes.

1. Read the file one line at a time and write each column to a separate file.
2. Then read those files in order, and accumulates the required data.

If the results set itself poses a memory problem, then the results could be written as they are accumulated.

```#! perl -slw
use strict;
use constant TEMPNAME => 'temp,out.';

my @row = split ' ', scalar <>;
my @fhs;
open \$fhs[ \$_ ], '+>', TEMPNAME . \$_ for 0 .. \$#row;

print { \$fhs[ \$_ ] } \$row[ \$_ ] for 0 .. \$#row;

while( <> ) {
@row = split;
print { \$fhs[ \$_ ] } \$row[ \$_ ] for 0 .. \$#row;
}

my( \$i, @cCounts, @iRows, @nonZs ) = ( 0, 0 );

for my \$fh ( @fhs ) {
seek \$fh, 0, 0;
my \$count = 0;
while( <\$fh> ) {
chomp;
next unless 0+\$_;
++\$count;
\$iRows[ \$i ] = \$. - 1;
\$nonZs[ \$i ] = \$_;
++\$i;
}
push @cCounts, \$cCounts[ \$#cCounts ] + \$count;
}

print "@\$_" for \( @cCounts, @iRows, @nonZs );

close \$_ for @fhs;
unlink TEMPNAME . \$_ for 0 .. \$#fhs;

__END__
C:\test>791009 sample.dat
0 2 5 9 10 12
0 1 0 2 4 1 2 3 4 2 1 4
2 3 3 -1 4 4 -3 1 2 2 6 1

The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.

For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://791009]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2020-02-18 16:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
What numbers are you going to focus on primarily in 2020?

Results (76 votes). Check out past polls.

Notices?