Task 2: Submatrix Sum
Submitted by: Jorg Sommrey
You are given a NxM matrix A of integers.
Write a script to construct a (N1)x(M1) matrix B
having elements that are the sum over the 2x2 submatrices of A,
b[i,k] = a[i,k] + a[i,k+1] + a[i+1,k] + a[i+1,k+1]
Example 1
Input: $a = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]
]
Output: $b = [
[14, 18, 22],
[30, 34, 38]
]
I tried to crank it up for entertainment  to solve not "2x2", but arbitrary "WxH" submatrices of (relatively) large matrices. Sadly, it was too late into many stupid plots and tables, when I realized, that excessive summation isn't required at all. Sliding (moving) sum (or average, min/max, etc.) is a well known concept  duh! not to uneducated me, alas. And so maybe "2x2", not "WxH", in the task, was selected not only because PWC tries to accommodate new learners. Anyway, I implemented "sliding submatrix sums" algorithm only at a later stage, in PDL, and didn't "backport" it to pure Perl (nor Pdlpp) because had already been somewhat fedup with this 2482, but decided to present sanitized results as a meditation.
(This is "meditation", there's no clean runnable complete script as such. Everything ended up in a kind of messy draftlike state. Benchmarking subroutine from my previous mediation was tweaked and reused. No testing for correct input e.g. if requested submatrix size is larger than matrix. "You are given array of integers" in PWC task is as unavoidable as "Once upon a time" in a fairy tale, and was ignored  all data below are doubles. That said, I'll gladly provide any details if requested.)
Almost all Perl solutions of "PWC 2482" utilize 2 nested loops and then sum the 4 terms, see the task as stated. Just a few explicitly expose 4 nested loops, where 2 innermost of them loop just over (0, 1) each. It penalizes performance for these "few" solutions by about a factor of 2, in my tests, but, looking at their code, it is tempting to use it as template for WxH submatrix sums. However, it quickly became clear that depth of 3, not 4, of nested loops is enough if sums of partial rows are reused i.e. stored in temporary array. Consider:
sub sms_WxH_Perl_4loops( $m, $w, $h ) {
my ( $W, $H ) = ( scalar $m> [0]> @*, scalar @$m );
my @ret;
for my $i ( 0 .. $H  $h ) {
for my $j ( 0 .. $W  $w ) {
for my $k ( 0 .. $w  1 ) {
for my $l ( 0 .. $h  1 ) {
$ret[ $i ][ $j ] += $m> [ $i + $l ][ $j + $k ];
}
}
}
}
return \@ret
}
sub sms_WxH_Perl_3loops( $m, $w, $h ) {
my ( $W, $H ) = ( scalar $m> [0]> @*, scalar @$m );
my ( @tmp, @ret );
for my $i ( 0 .. $H  1 ) {
for my $j ( 0 .. $W  $w ) {
for my $k ( 0 .. $w  1 ) {
$tmp[ $i ][ $j ] += $m> [ $i ][ $j + $k ];
}
next if $i < $h  1;
for my $k ( 0 .. $h  1 ) {
$ret[ $i  $h + 1 ][ $j ] += $tmp[ $i  $k ][ $j ];
}
}
}
return \@ret
}
$m = [
map [
map rand, 0 .. 149
], 0 .. 149
];
__END__
+
Time (s) vs. N (NxN submatrix, 150x150 matrix)
++
+ + + + + + + + 
10 + +
+ +
+ +
+ A A A A +
+ A A +
+ A A +
 A A 
 
1 + A +
+ A +
+ +
+ +
+ A +
 A 
+ +
 B B B B B B 
0.1 + B B B +
+ B +
+ B B +
+ B +
+ +
 
+ B +
 
0.01 + +
+ + + + + + + + +
++
0 20 40 60 80 100 120 140
sms_WxH_Perl_4loops A
sms_WxH_Perl_3loops B
++++
 N  A  B 
++++
 10  0.250  0.053 
 20  0.809  0.091 
 30  1.528  0.116 
 40  2.288  0.134 
 50  2.988  0.144 
 60  3.512  0.147 
 70  3.828  0.144 
 80  3.862  0.134 
 90  3.628  0.122 
 100  3.122  0.103 
 110  2.453  0.088 
 120  1.675  0.066 
 130  0.900  0.044 
 140  0.284  0.022 
++++
Switching to caching and depth of "3" pays off quite handsomely.
###########################################################
###########################################################
### ###
### Enough of "pure Perl". Show us PDL as promised! ###
### ###
###########################################################
###########################################################
My initial (i.e. 2x2) PDL solution was:
sub sms_2x2_PDL ( $m ) {
$m> slice( '0:2', '0:2' ) +
$m> slice( '1:1', '0:2' ) +
$m> slice( '0:2', '1:1' ) +
$m> slice( '1:1', '1:1' )
}
And, "upgraded" to WxH:
sub sms_WxH_PDL_naive ( $m, $w, $h ) {
my $ret = 0;
for my $r ( 0 .. $h  1 ) {
my $r_sel = "$r:" . ( $r  $h );
for my $c ( 0 .. $w  1 ) {
$ret += $m> slice( "$c:" . ( $c  $w ), $r_sel )
}
}
return $ret
}
It doesn't perform too great (see further), and looks like "grass snake and hedgehog crossbreed"  neither Perl nor "pure" PDL, which is supposed to be loopless. Then I (sneakily) browsed PWC repository for 2482 PDL solutions, if any  to "borrow" inspiration  and found Pdlpp one, with which I had no previous experience. All the more interesting challenge to "translate" "4" and "3 loops" Perl code to Pdlpp:
use Inline Pdlpp => <<'EOPP';
pp_def('sms_WxH_pdlpp_4loops',
Pars => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',
RedoDimsCode => '$SIZE(p) = $SIZE(m)  $COMP(par_sm_w) + 1;
$SIZE(q) = $SIZE(n)  $COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, l, w, h, W, H;
w = $COMP(par_sm_w);
h = $COMP(par_sm_h);
W = $SIZE(m);
H = $SIZE(n);
for (i = 0; i < H  h + 1; i++) {
for (j = 0; j < W  w + 1; j++) {
for (k = 0; k < w; k++) {
for (l = 0; l < h; l++) {
$b(p=>j, q=>i) += $a(m=>j+k, n=>i+l);
}
}
}
}
',
);
pp_def('sms_WxH_pdlpp_3loops',
Pars => 'a(m,n); [o]b(p,q); [t]t(p,n)',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',
RedoDimsCode => '$SIZE(p) = $SIZE(m)  $COMP(par_sm_w) + 1;
$SIZE(q) = $SIZE(n)  $COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, w, h, W, H;
w = $COMP(par_sm_w);
h = $COMP(par_sm_h);
W = $SIZE(m);
H = $SIZE(n);
for (i = 0; i < H; i++) {
for (j = 0; j < W  w + 1; j++) {
for (k = 0; k < w; k++) {
$t(p=>j, n=>i) += $a(m=>j+k, n=>i);
}
if (i >= h  1) {
for (k = 0; k < h; k++) {
$b(p=>j, q=>ih+1) += $t(p=>j, n=>ik);
}
}
}
}
',
);
EOPP
So, yeah, it was a challenge, but was it fun? It's supposed to translate to XS and C and compile to native binary code, to be the fastest, right? Am I even qualified to see if I blundered somehow about it, above?
Not fun for sure. Wanted to use highlevel array language, finished with lowlevel Clike thing. I then looked further through PDL assortment of standard functions  and found "lags", which can slice an array into overlapping infixes. It is somewhat odd, because expects, as argument, not sliding window size, but something else to compute it (see further), and returns its result in reverse order  hence I have to flip it back. Consider:
sub sms_WxH_PDL_lags ( $m, $w, $h ) {
my ( $W, $H ) = $m> dims;
return $m
> lags( 0, 1, 1 + $W  $w )
> sumover
> lags( 1, 1, 1 + $H  $h )
> xchg( 0, 1 )
> sumover
> slice( '1:0,1:0' )
}
$m = 1 + sequence 4, 3;
print $m,
sms_WxH_PDL_lags( $m, 2, 2 ),
sms_WxH_PDL_lags( $m, 3, 3 ),
sms_WxH_PDL_lags( $m, 4, 3 ),
sms_WxH_PDL_lags( $m, 1, 1 );
__END__
[
[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
]
[
[14 18 22]
[30 34 38]
]
[
[54 63]
]
[
[78]
]
[
[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
]
Looks like my function behaves as expected.
###########################################################
###########################################################
### ###
### What about results? ###
### ###
###########################################################
###########################################################
+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
++
2 + + + + + + + + + +
 C C 
 C 
 C 
 D D D D C 
1.5 + A C +
 D D 
 C 
 C D 
 
 B D D 
1 + +
 C 
 C 
 D D 
 
 A D 
0.5 + C +
 D 
 
 B D 
 
 DC 
0 + DD +
 + + + + + + + + 
++
0 200 400 600 800 1000 1200 1400
sms_WxH_PDL_naive A
sms_WxH_pdlpp_4loops B
sms_WxH_pdlpp_3loops C
sms_WxH_PDL_lags D
++++++
 N  A  B  C  D 
++++++
 2  0.038  0.009  0.009  0.047 
 4  0.294  0.031  0.019  0.041 
 6  0.572  0.072  0.019  0.044 
 10  1.531  0.250  0.031  0.031 
 20   1.150  0.072  0.022 
 100    0.494  0.394 
 200    0.931  0.797 
 300    1.263  1.153 
 400    1.544  1.431 
 500    1.756  1.594 
 600    1.869  1.653 
 700    1.872  1.641 
 800    1.788  1.594 
 900    1.597  1.481 
 1000    1.362  1.288 
 1100    1.144  1.081 
 1200    0.841  0.809 
 1300    0.575  0.572 
 1400    0.291  0.272 
++++++
I have no inclination to wait for manymany seconds (hours?) for "A" and "B" to draw their belllike plots. Naive code ("A") is perhaps useless beyond 2x2 for large matrices, but for that my original 2x2 version can be used instead. "B" was probably useless to begin with; it's only there for completeness sake. What I'm really (pleasantly) surprised about  my "lags" "D" solution beats PP/XS code i.e. "C", which, to my understanding, should do the same things verbatim at low level.
###########################################################
###########################################################
### ###
### But where's "sliding sum" solution as advertised? ###
### ###
###########################################################
###########################################################
Ahh, at last! Consider:
sub _do_dimension ( $m, $w ) {
$m > slice([ 0, $w  1 ], [])
> sumover
> slice( '*' )
> append(
$w >= ( $m> dims )[ 0 ]
? empty
: $m> slice([ $w, 1 ], []) 
$m> slice([ 0, 1  $w ], []))
> cumusumover
> xchg( 0, 1 )
}
sub sms_WxH_PDL_sliding ( $m, $w, $h ) {
_do_dimension
_do_dimension( $m, $w ), $h
}
$m = 1 + sequence 4, 3;
print $m,
sms_WxH_PDL_sliding( $m, 2, 2 ),
sms_WxH_PDL_sliding( $m, 3, 3 ),
sms_WxH_PDL_sliding( $m, 4, 3 ),
sms_WxH_PDL_sliding( $m, 1, 1 );
__END__
# same (correct) output as above, skipped
+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
++
 + + + + + + + + 
 A 
 A 
 A A 
1.5 + A +
 A 
 
 A 
 
 A 
 A 
1 + +
 
 A 
 A 
 
 
 A 
0.5 + +
 
 A 
 A 
 
 B B 
 BB B B B B B B B B B 
0 + B B B B +
 + + + + + + + + 
++
0 200 400 600 800 1000 1200 1400
sms_WxH_PDL_lags A
sms_WxH_PDL_sliding B
++++
 N  A  B 
++++
 2  0.025  0.100 
 4  0.034  0.075 
 6  0.016  0.084 
 10  0.034  0.091 
 20  0.028  0.084 
 100  0.369  0.081 
 200  0.800  0.078 
 300  1.122  0.116 
 400  1.416  0.050 
 500  1.572  0.053 
 600  1.644  0.022 
 700  1.675  0.025 
 800  1.587  0.019 
 900  1.456  0.019 
 1000  1.297  0.009 
 1100  1.047  0.009 
 1200  0.813  0.019 
 1300  0.566  0.006 
 1400  0.275  0.003 
++++
###########################################################
###########################################################
### ###
### It was trivial. CPAN surely has something already. ###
### You have wasted your time. ###
### ###
###########################################################
###########################################################
Eh? I've searched CPAN for "PDL sliding". There's the PDL::Apply. I have to wrap one of the functions it provides, to get "sliding submatrix sum", as
use PDL::Apply 'apply_rolling';
sub sms_WxH_PDL_ar ( $m, $w, $h ) {
$m> apply_rolling( $w, 'sum' )
> slice( [ $w  1, 1 ], [] )
> xchg( 0, 1 )
> apply_rolling( $h, 'sum' )
> slice( [ $h  1, 1 ], [] )
> xchg( 0, 1 )
}
my $m = random 150, 150;
$t = time;
sms_WxH_PDL_ar( $m, 2, 2 );
say time  $t;
$t = time;
sms_WxH_PDL_ar( $m, 10, 10 );
say time  $t;
$t = time;
sms_WxH_PDL_ar( $m, 75, 75 );
say time  $t;
__END__
3.36784410476685
3.20836806297302
1.28170394897461
No plots/tables here. Note, it's back to 150x150 matrix i.e. "pure Perl" league, and hugely less performant than "pure Perl" nonsliding function at the top of the node. Did I use it wrong? Not sure what to make of it.
###########################################################
###########################################################
### ###
### OK. Conclusions? ###
### ###
###########################################################
###########################################################
 PDL is as great as Perl itself, and as fun to use/explore as ever. :)
 Maybe I'd use my sms_2x2_PDL for 2x2 sums, then sms_WxH_PDL_lags for relatively small submatrices, then switch to sms_WxH_PDL_sliding.
 If you (i.e. me) are not qualified to write performant C code, don't expect to write PP/XS/C to perform better than your highlevel language code, only by virtue of "it's C!" and "modern compiler will optimize it for me" (see "D" vs. "C" case above).
 Before you (i.e. me) megalomaniacally "crank up" exercises, try looking for better algorithms to solve them. :)
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by choroba (Cardinal) on Dec 26, 2023 at 12:42 UTC

Great post! ++ What a shame it's an anonymous one.
Have you compared your solution with wlmb's solution which uses PDL, too?
I started with PDL, as well, but I got frustrated soon and submitted a nested loops solution.
map{substr$_>[0],$_>[1]0,1}[\*{},3],[[]],[ref qr1,,1],[{}],[sub{}^*ARGV,3]
 [reply] [d/l] 

Yes, I have. His solution and my sms_2x2_PDL look almost the same, but there's a difference: he supplies a list of slices as argument to pdl() constructor. Such list, regardless of its length, uses almost zero resources, because slice, as many other e.g. PDL::Slice functions, creates virtual ndarray i.e. header pointing to original data. The constructor, however, builds fresh new ndarray  and that even before any summation has begun. Probably, temporary pike in memory usage is negligible (*) in case of 2x2 submatrices. Usage for WxH case is perhaps impractical and only speculative, but, suppose 1500x1500 data, 10x10 frame  which means 100 slices 1490*1490 each i.e. ~1,7 Gb temporary monster.
*: but why not measure its impact:
sub sms_2x2_PDL_wlmb ( $m ) {
pdl(
$m> slice( '0:2,0:2' ),
$m> slice( '1:1,0:2' ),
$m> slice( '0:2,1:1' ),
$m> slice( '1:1,1:1' ),
)> mv( 1, 0 )> sumover
}
__END__
Time (s) vs. N (2x2 submatrix, NxN matrix)
++
+ + + + + + 
 B 
 
 
 
1.5 + +
 
 
 
 B 
 
1 + +
 
 
 
 A 
 
 B 
0.5 + +
 A 
 
 B 
 B A 
 B A 
 B A A 
0 + B A +
+ + + + + + 
++
0 1000 2000 3000 4000 5000
sms_2x2_PDL A
sms_2x2_PDL_wlmb B
++++
 N  A  B 
++++
 400  0.002  0.011 
 800  0.006  0.041 
 1200  0.023  0.098 
 1600  0.061  0.180 
 2000  0.109  0.273 
 3000  0.238  0.613 
 4000  0.433  1.148 
 5000  0.738  1.758 
++++
 [reply] [d/l] [select] 
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by tybalt89 (Monsignor) on Dec 26, 2023 at 21:19 UTC

q(May all your Christmases be white.) =~ s/Christmase/loop/r =~ s/whit
+e/implicit/r
Inspired by mention of "sliding".
#!/usr/bin/perl
use strict; # https://theweeklychallenge.org/blog/perlweeklychalleng
+e248/#TASK2
use warnings;
use List::AllUtils qw( zip_by reduce );
use Data::Dump qw( pp );
sub slide
{
my @new;
reduce { push @new, $a + $b; $b } @_;
@new;
}
for ( [ [1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12] ],
[ [1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1] ],
)
{
print 'Input: $a = ', pp($_), "\n";
my @new = zip_by { [ @_ ] } map [ slide @$_ ],
zip_by { [ @_ ] } map [ slide @$_ ], @$_;
print 'Output: $b = ', pp(\@new), "\n\n";
}
Outputs
Input: $a = [[1 .. 4], [5 .. 8], [9 .. 12]]
Output: $b = [[14, 18, 22], [30, 34, 38]]
Input: $a = [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]
Output: $b = [[2, 1, 0], [1, 2, 1], [0, 1, 2]]
 [reply] [d/l] [select] 

#!/usr/bin/perl
use strict; # https://theweeklychallenge.org/blog/perlweeklychalleng
+e248/#TASK2
use warnings;
use List::AllUtils qw( sum zip_by reductions );
use Data::Dump qw( pp );
sub nslide # size, elements
{
my @q = splice @_, 0, shift;
return reductions { push @q, $b; $a + $b  shift @q } sum(@q), @_;
}
my ($width, $height) = (2, 2);
for ( [ [1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12] ],
[ [1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1] ],
)
{
print 'Input: $a = ', pp($_), "\n";
my @new = zip_by { [ @_ ] } map [ nslide $height, @$_ ],
zip_by { [ @_ ] } map [ nslide $width, @$_ ], @$_;
print 'Output: $b = ', pp(\@new), "\n\n";
}
 [reply] [d/l] 

sub sms_WxH_Perl_sliding_tybalt89 ( $m, $width, $height ) {
my @new = zip_by { [ @_ ] } map [ nslide $height, @$_ ],
zip_by { [ @_ ] } map [ nslide $width, @$_ ], @$m;
return \@new
}
__END__
Time (s) vs. N (NxN submatrix, 1500x1500 matrix)
++
1.8 + + + + + + + + + +
 A 
 A 
1.6 + +
 A 
 
1.4 + +
 A 
 
1.2 + A +
 
 A 
 A 
1 + +
 A 
 
0.8 + A +
 
 A 
0.6 + A +
 A 
 
0.4 + A +
 A 
 A 
0.2 + +
 A 
 + + + + + + + + 
0 ++
0 200 400 600 800 1000 1200 1400
sms_WxH_Perl_sliding_tybalt89 A
+++
 N  A 
+++
 2  1.700 
 10  1.725 
 100  1.538 
 200  1.387 
 300  1.259 
 400  1.131 
 500  1.016 
 600  0.894 
 700  0.784 
 800  0.678 
 900  0.578 
 1000  0.484 
 1100  0.394 
 1200  0.309 
 1300  0.231 
 1400  0.153 
+++
 [reply] [d/l] 
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by wlmb (Novice) on Dec 27, 2023 at 20:19 UTC

use v5.36;
use PDL;
use PDL::Image2D;
my $w=4;
my $h=5;
my $m=sequence(10,10); # or any other matrix
say $m>conv2d(ones($w,$h))>slice([floor(($w1)/2),floor(($w+1)/2)],
[floor(($h1)/2),floor(($h+1)/2)]);
I expected box2d to work also, but either it doesn't work or I haven't understood it yet.
 [reply] [d/l] 

Thanks, nice find. By the way, perhaps I've chosen imprecise word ("sliding"). Maybe in math/science, any function, applied to overlapping infixes, is said to be applied in sliding/moving/rolling manner. What I meant instead, with the word,  "applied using computationally efficient/cheap algorithm".
Looks like conv2d does very honest (therefore not efficient/cheap) 4nestedloops style calculation, in PP/XS/C  OK for very small kernels. Here's overcrowded plot, but B and E cases would suffice to show they are the same breed.
sub sms_WxH_PDL_conv2d ( $m, $w, $h ) {
$m > conv2d( ones( $w, $h ))
> slice(
[floor(($w1)/2),floor(($w+1)/2)],
[floor(($h1)/2),floor(($h+1)/2)] )
}
__END__
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
++
1 + + + + + + + + + +
 A 
 E 
 
 
 
0.8 + +
 B 
 
 
 
0.6 + A +
 E 
 
 
 
0.4 + B +
 E 
 
 A 
 E B 
0.2 + +
 E B 
 A D 
 D D E D D D D D 
 E E C C C C C C 
0 + B C +
 + + + + + + + + 
++
2 4 6 8 10 12 14 16
sms_WxH_PDL_naive A
sms_WxH_pdlpp_4loops B
sms_WxH_PDL_lags C
sms_WxH_PDL_sliding D
sms_WxH_PDL_conv2d E
+++++++
 N  A  B  C  D  E 
+++++++
 2  0.061  0.009  0.030  0.083  0.020 
 3  0.120  0.019  0.013  0.073  0.042 
 4  0.252  0.030  0.039  0.098  0.066 
 6  0.566  0.078  0.028  0.078  0.141 
 8  0.963  0.139  0.033  0.081  0.237 
 10   0.248  0.031  0.078  0.366 
 12   0.388  0.033  0.078  0.531 
 16   0.728  0.041  0.069  0.928 
+++++++
 [reply] [d/l] [select] 

use v5.36;
use PDL;
use PDL::FFT;
my $matrix=pdl(shift); # Format "[[m11,m21...],[m21...]...]"
my $width=my $w=shift;
my $height=my $h=shift;
my $small=ones($w%2?$w:$w+1, $h%2?$h:$h+1);
$small>slice(1).=0, ++$w unless $w%2; # zero row and/or column for e
+ven kernels
$small>slice([],1).=0, ++$h unless $h%2;
my $kernel=kernctr($matrix, $small); #full kernel
my $result=$matrix>copy;
$result>fftconvolve($kernel);
say "$matrix $width $height > ",
$result>slice([floor(($width1)/2),floor(($width+1)/2)],
[floor(($height1)/2),floor(($height+1)/2)]);
It would be interesting to also compare it.  [reply] [d/l] 


Three "my" and one ")" missing. Otherwise ++👍
 [reply] [d/l] [select] 

 [reply] 
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by jo37 (Deacon) on Dec 30, 2023 at 16:56 UTC

use strict;
use warinings;
use PDL;
use PDL::NiceSlice;
sub submatrix_sum {
my $m = pdl @_;
$m>range(ndcoords($m(1:,1:)), 2)>reorder(2, 3, 0, 1)
>clump(2)>sumover;
}
It can easily be extended to WxH submatrices:
sub submatrix_sum {
my $w = shift;
my $h = shift;
my $m = pdl @_;
$m>range(ndcoords($m($w  1:, $h  1:), [$w, $h])>reorder(2, 3,
+0, 1)
>clump(2)>sumover;
}
Greetings, jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
 [reply] [d/l] [select] 

Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDIish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, ehhm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:
...> sumover> sumover
which you replaced with
...> clump(2)> sumover
(Frankly, it's obvious I think that huge speed difference of GameofLife implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only,  but perhaps that's how tutorials work.)
Consider:
sub sms_WxH_PDL_range ( $m, $w, $h ) {
my ( $W, $H ) = $m> dims;
$m> range( ndcoords( $W  $w + 1, $H  $h + 1 ), [ $w, $h ])
> reorder( 2, 3, 0, 1 )
> clump( 2 )
> sumover
}
sub sms_WxH_PDL_range_b ( $m, $w, $h ) {
my ( $W, $H ) = $m> dims;
$m> range( ndcoords( $W  $w + 1, $H  $h + 1 ), [ $w, $h ])
> reorder( 2, 3, 0, 1 )
> sumover
> sumover
}
__END__
Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix)
++
+ + + + + + 
1.6 + A +
 
 
1.4 + +
 
1.2 + +
 A 
 
1 + +
 
 B 
0.8 + +
 A 
 
0.6 + B +
 
0.4 + A +
 B 
 A 
0.2 + A B +
 A B B 
 A B B D D 
0 + D D D D D D D D D C C +
+ + + + + + 
++
0 5 10 15 20 25
sms_WxH_PDL_range A
sms_WxH_PDL_range_b B
sms_WxH_PDL_lags C
sms_WxH_PDL_naive D
++++++
 N  A  B  C  D 
++++++
 2  0.015  0.008  0.000  0.000 
 3  0.021  0.018  0.000  0.000 
 4  0.044  0.021  0.000  0.000 
 5  0.073  0.047  0.000  0.003 
 6  0.101  0.060  0.000  0.000 
 8  0.193  0.104  0.000  0.005 
 10  0.294  0.138  0.000  0.005 
 12  0.435  0.232  0.000  0.010 
 16  0.711  0.344  0.000  0.015 
 20  1.115  0.549  0.000  0.026 
 25  1.573  0.828  0.000  0.047 
++++++
(I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.
It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:
use strict;
use warnings;
use experimental qw/ say postderef signatures /;
use Time::HiRes 'time';
use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';
use constant STEPS => 100;
my $x = zeroes( 200, 200 );
# Put in a simple glider.
$x(1:3,1:3) .= pdl ( [1,1,1],
[0,0,1],
[0,1,0] );
my $backup = $x> copy;
printf "Game of Life!\nMatrix: %s, %d generations\n",
$x> info, STEPS;
# Tutorial
my $t = time;
my $ct = 0;
for ( 1 .. STEPS ) {
my $t_ = time;
# Calculate the number of neighbours per cell.
my $n = $x>range(ndcoords($x)1,3,"periodic")>reorder(2,3,0,1);
$n = $n>sumover>sumover  $x;
$ct += time  $t_;
# Calculate the next generation.
$x = ((($n == 2) + ($n == 3))* $x) + (($n==3) * !$x);
}
printf "Tutorial: %0.3f s (core time: %0.3f)\n",
time  $t, $ct;
# "Lags"
my $m = $backup> copy;
$t = time;
$ct = 0;
for ( 1 .. STEPS ) {
my $t_ = time;
# Calculate the number of neighbours per cell.
my $n = sms_GoL_lags( $m )  $m;
$ct += time  $t_;
# Calculate the next generation.
$m = ((($n == 2) + ($n == 3))* $m) + (($n == 3) * !$m);
}
printf "\"lags\": %0.3f s (core time: %0.3f)\n",
time  $t, $ct;
die unless eq_pdl( $x, $m );
sub _do_dimension_GoL ( $m ) {
$m> slice( 1 )> glue( 0, $m, $m> slice( 0 ))
> lags( 0, 1, ( $m> dims )[0] )
> sumover
> slice( '', '1:0' )
> xchg( 0, 1 )
}
sub sms_GoL_lags ( $m ) {
_do_dimension_GoL
_do_dimension_GoL $m
}
__END__
Game of Life!
Matrix: PDL: Double D [200,200], 100 generations
Tutorial: 1.016 s (core time: 0.835)
"lags": 0.283 s (core time: 0.108)
Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out  because next gen calculation is not very efficient (e.g. $n == 3 array is built twice), but that's another story  which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.
 [reply] [d/l] [select] 

Hi Steve (assuming it's you),
I haven't spent any effort on optimizing the solution.
Thank you for running the tests!
It is interesting to see that >clump(2)>sumover doesn't perform as good as >sumover>sumover.
Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory.
Looks like I'm getting older :)
Greetings, jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
 [reply] [d/l] [select] 

Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from:
 To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function.
 Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4).
use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';
use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';
use constant {
WIDTH => 20,
HEIGHT => 20,
STEPS => 1000,
};
my $x = zeroes long, WIDTH, HEIGHT;
# Put in a simple glider.
$x(1:3,1:3) .= pdl ( [1,1,1],
[0,0,1],
[0,1,0] );
my $backup = $x> copy;
printf "Game of Life!\nMatrix: %s, %d generations\n",
$x> info, STEPS;
# Tutorial
my $t = time;
for ( 1 .. STEPS ) {
my $t_ = time;
# Calculate the number of neighbours per cell.
my $n = $x>range(ndcoords($x)1,3,"periodic")>reorder(2,3,0,1);
$n = $n>sumover>sumover  $x;
# Calculate the next generation.
$x = ((($n == 2) + ($n == 3))* $x) + (($n == 3) * !$x);
}
printf "Tutorial: %0.3f s\n", time  $t;
# Faster
my $m = $backup> copy;
$t = time;
my $wrap_w = pdl [ reverse WIDTH  1, ( 0 .. WIDTH  1 ), 0 ];
my $wrap_h = pdl [ reverse HEIGHT  1, ( 0 .. HEIGHT  1 ), 0 ];
for ( 1 .. STEPS ) {
my $n = $m
> dice_axis( 0, $wrap_w )
> lags( 0, 1, WIDTH )
> sumover
> dice_axis( 1, $wrap_h )
> lags( 1, 1, HEIGHT )
> xchg( 0, 1 )
> sumover;
$n = $m;
$m = ( $n == 3 )  $m & ( $n == 2 )
}
printf "Faster: %0.3f s\n", time  $t;
die unless eq_pdl( $x, $m );
__END__
Game of Life!
Matrix: PDL: Long D [20,20], 1000 generations
Tutorial: 0.341 s
Faster: 0.111 s
Matrix: PDL: Long D [200,200], 100 generations
Tutorial: 0.845 s
Faster: 0.086 s
Matrix: PDL: Long D [1000,1000], 20 generations
Tutorial: 4.422 s
Faster: 0.443 s
Matrix: PDL: Long D [2000,2000], 10 generations
Tutorial: 8.878 s
Faster: 0.872 s
 [reply] [d/l] 

Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by etj (Deacon) on Feb 15, 2024 at 20:45 UTC

pp_def('sms_WxH_pdlpp_4loops',
Pars => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',
RedoDimsCode => '$SIZE(p) = $SIZE(m)  $COMP(par_sm_w) + 1;
$SIZE(q) = $SIZE(n)  $COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, l, w, h, W, H;
w = $COMP(par_sm_w);
h = $COMP(par_sm_h);
W = $SIZE(m);
H = $SIZE(n);
for (i = 0; i < H  h + 1; i++) {
for (j = 0; j < W  w + 1; j++) {
for (k = 0; k < w; k++) {
for (l = 0; l < h; l++) {
$b(p=>j, q=>i) += $a(m=>j+k, n=>i+l);
}
}
}
}
',
);
Can use PP features much better (untested, illustrative):
pp_def('sms_WxH_pdlpp_4loops',
Pars => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w => w; PDL_Indx par_sm_h => h;',
RedoDimsCode => '$SIZE(p) = $SIZE(m)  $SIZE(w) + 1;
$SIZE(q) = $SIZE(n)  $SIZE(h) + 1;',
Code => 'loop(q,p,w,h) %{ $b() += $a(m=>p+w, n=>q+h); %}',
);
Features used above: OtherPars can set $SIZE of dims not otherwise used. When something is a $SIZE, you can loop() over it. If an index (named dim) is indexed by a variable of the same name, it can be omitted (as for $b() above). Obviously I used the fact that you can give several loops at once, which is very neat. Finally, your loop conditions were recalculating the expressions used to set $SIZE(p) etc.
Looking at the loop(), I suspect you want to switch the w and h loops to do everything as much as possible on the fastestmoving dimension. This algorithm would probably benefit a lot from tiling for maximum cachehitting, which should probably become a proper PP feature. If someone wants to open a PDL issue with suggestions as to what the syntax would look like, it will probably get implemented and used on matmult.  [reply] [d/l] [select] 

