There's more than one way to do things PerlMonks

### Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)

 on Dec 26, 2023 at 00:51 UTC Need Help??

```Task 2: Submatrix Sum
Submitted by: Jorg Sommrey

You are given a NxM matrix A of integers.

Write a script to construct a (N-1)x(M-1) matrix B
having elements that are the sum over the 2x2 submatrices of A,

b[i,k] = a[i,k] + a[i,k+1] + a[i+1,k] + a[i+1,k+1]

Example 1

Input: \$a = [
[1,  2,  3,  4],
[5,  6,  7,  8],
[9, 10, 11, 12]
]

Output: \$b = [
[14, 18, 22],
[30, 34, 38]
]

I tried to crank it up for entertainment -- to solve not "2x2", but arbitrary "WxH" sub-matrices of (relatively) large matrices. Sadly, it was too late into many stupid plots and tables, when I realized, that excessive summation isn't required at all. Sliding (moving) sum (or average, min/max, etc.) is a well known concept -- duh! not to un-educated me, alas. And so maybe "2x2", not "WxH", in the task, was selected not only because PWC tries to accommodate new learners. Anyway, I implemented "sliding submatrix sums" algorithm only at a later stage, in PDL, and didn't "backport" it to pure Perl (nor Pdlpp) because had already been somewhat fed-up with this 248-2, but decided to present sanitized results as a meditation.

(This is "meditation", there's no clean runnable complete script as such. Everything ended up in a kind of messy draft-like state. Benchmarking subroutine from my previous mediation was tweaked and re-used. No testing for correct input e.g. if requested sub-matrix size is larger than matrix. "You are given array of integers" in PWC task is as unavoidable as "Once upon a time" in a fairy tale, and was ignored -- all data below are doubles. That said, I'll gladly provide any details if requested.)

Almost all Perl solutions of "PWC 248-2" utilize 2 nested loops and then sum the 4 terms, see the task as stated. Just a few explicitly expose 4 nested loops, where 2 innermost of them loop just over (0, 1) each. It penalizes performance for these "few" solutions by about a factor of 2, in my tests, but, looking at their code, it is tempting to use it as template for WxH submatrix sums. However, it quickly became clear that depth of 3, not 4, of nested loops is enough if sums of partial rows are re-used i.e. stored in temporary array. Consider:

```sub sms_WxH_Perl_4loops( \$m, \$w, \$h ) {
my ( \$W, \$H ) = ( scalar \$m-> [0]-> @*, scalar @\$m );
my @ret;
for my \$i ( 0 .. \$H - \$h ) {
for my \$j ( 0 .. \$W - \$w ) {
for my \$k ( 0 .. \$w - 1 ) {
for my \$l ( 0 .. \$h - 1 ) {
\$ret[ \$i ][ \$j ] += \$m-> [ \$i + \$l ][ \$j + \$k ];
}
}
}
}
return \@ret
}

sub sms_WxH_Perl_3loops( \$m, \$w, \$h ) {
my ( \$W, \$H ) = ( scalar \$m-> [0]-> @*, scalar @\$m );
my ( @tmp, @ret );
for my \$i ( 0 .. \$H - 1 ) {
for my \$j ( 0 .. \$W - \$w ) {
for my \$k ( 0 .. \$w - 1 ) {
\$tmp[ \$i ][ \$j ] += \$m-> [ \$i ][ \$j + \$k ];
}
next if \$i < \$h - 1;
for my \$k ( 0 .. \$h - 1 ) {
\$ret[ \$i - \$h + 1 ][ \$j ] += \$tmp[ \$i - \$k ][ \$j ];
}
}
}
return \@ret
}

\$m = [
map [
map rand, 0 .. 149
], 0 .. 149
];

__END__

+
Time (s) vs. N (NxN submatrix, 150x150 matrix)
+----------------------------------------------------------+
|+       +      +       +      +       +       +      +    |
10 |-+                                                      +-|
|+                                                        +|
|+                                                        +|
|+                      A   A  A   A                      +|
|+                  A                  A                  +|
|+              A                          A              +|
|           A                                  A           |
|                                                          |
1 |-+                                               A      +-|
|+       A                                                +|
|+                                                        +|
|+                                                        +|
|+                                                    A   +|
|    A                                                     |
|+                                                        +|
|               B   B   B   B  B   B                       |
0.1 |-+      B  B                          B                 +-|
|+                                         B              +|
|+   B                                         B          +|
|+                                                B       +|
|+                                                        +|
|                                                          |
|+                                                    B   +|
|                                                          |
0.01 |-+                                                      +-|
|+       +      +       +      +       +       +      +   +|
+----------------------------------------------------------+
0       20     40      60     80     100     120    140
sms_WxH_Perl_4loops    A
sms_WxH_Perl_3loops    B
+-----+-------+-------+
| N   | A     | B     |
+-----+-------+-------+
| 10  | 0.250 | 0.053 |
| 20  | 0.809 | 0.091 |
| 30  | 1.528 | 0.116 |
| 40  | 2.288 | 0.134 |
| 50  | 2.988 | 0.144 |
| 60  | 3.512 | 0.147 |
| 70  | 3.828 | 0.144 |
| 80  | 3.862 | 0.134 |
| 90  | 3.628 | 0.122 |
| 100 | 3.122 | 0.103 |
| 110 | 2.453 | 0.088 |
| 120 | 1.675 | 0.066 |
| 130 | 0.900 | 0.044 |
| 140 | 0.284 | 0.022 |
+-----+-------+-------+

Switching to caching and depth of "3" pays off quite handsomely.

```###########################################################
###########################################################
###                                                     ###
###   Enough of "pure Perl". Show us PDL as promised!   ###
###                                                     ###
###########################################################
###########################################################
```

My initial (i.e. 2x2) PDL solution was:

```sub sms_2x2_PDL ( \$m ) {
\$m-> slice( '0:-2', '0:-2' ) +
\$m-> slice( '1:-1', '0:-2' ) +
\$m-> slice( '0:-2', '1:-1' ) +
\$m-> slice( '1:-1', '1:-1' )
}

```sub sms_WxH_PDL_naive ( \$m, \$w, \$h ) {
my \$ret = 0;

for my \$r ( 0 .. \$h - 1 ) {
my \$r_sel = "\$r:" . ( \$r - \$h );
for my \$c ( 0 .. \$w - 1 ) {
\$ret += \$m-> slice( "\$c:" . ( \$c - \$w ), \$r_sel )
}
}
return \$ret
}

It doesn't perform too great (see further), and looks like "grass snake and hedgehog cross-breed" -- neither Perl nor "pure" PDL, which is supposed to be loop-less. Then I (sneakily) browsed PWC repository for 248-2 PDL solutions, if any -- to "borrow" inspiration -- and found Pdlpp one, with which I had no previous experience. All the more interesting challenge to "translate" "4" and "3 loops" Perl code to Pdlpp:

```use Inline Pdlpp => <<'EOPP';

pp_def('sms_WxH_pdlpp_4loops',
Pars      => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',

RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$COMP(par_sm_w) + 1;
\$SIZE(q) = \$SIZE(n) - \$COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, l, w, h, W, H;
w = \$COMP(par_sm_w);
h = \$COMP(par_sm_h);
W = \$SIZE(m);
H = \$SIZE(n);
for (i = 0; i < H - h + 1; i++) {
for (j = 0; j < W - w + 1; j++) {
for (k = 0; k < w; k++) {
for (l = 0; l < h; l++) {
\$b(p=>j, q=>i) += \$a(m=>j+k, n=>i+l);
}
}
}
}
',
);

pp_def('sms_WxH_pdlpp_3loops',
Pars      => 'a(m,n); [o]b(p,q); [t]t(p,n)',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',

RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$COMP(par_sm_w) + 1;
\$SIZE(q) = \$SIZE(n) - \$COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, w, h, W, H;
w = \$COMP(par_sm_w);
h = \$COMP(par_sm_h);
W = \$SIZE(m);
H = \$SIZE(n);

for (i = 0; i < H; i++) {
for (j = 0; j < W - w + 1; j++) {
for (k = 0; k < w; k++) {
\$t(p=>j, n=>i) += \$a(m=>j+k, n=>i);
}
if (i >= h - 1) {
for (k = 0; k < h; k++) {
\$b(p=>j, q=>i-h+1) += \$t(p=>j, n=>i-k);
}
}
}
}
',
);

EOPP

So, yeah, it was a challenge, but was it fun? It's supposed to translate to XS and C and compile to native binary code, to be the fastest, right? Am I even qualified to see if I blundered somehow about it, above?

Not fun for sure. Wanted to use high-level array language, finished with low-level C-like thing. I then looked further through PDL assortment of standard functions -- and found "lags", which can slice an array into overlapping infixes. It is somewhat odd, because expects, as argument, not sliding window size, but something else to compute it (see further), and returns its result in reverse order -- hence I have to flip it back. Consider:

```sub sms_WxH_PDL_lags ( \$m, \$w, \$h ) {
my ( \$W, \$H ) = \$m-> dims;

return \$m
-> lags( 0, 1, 1 + \$W - \$w )
-> sumover
-> lags( 1, 1, 1 + \$H - \$h )
-> xchg( 0, 1 )
-> sumover
-> slice( '-1:0,-1:0' )
}

\$m = 1 + sequence 4, 3;
print \$m,
sms_WxH_PDL_lags( \$m, 2, 2 ),
sms_WxH_PDL_lags( \$m, 3, 3 ),
sms_WxH_PDL_lags( \$m, 4, 3 ),
sms_WxH_PDL_lags( \$m, 1, 1 );

__END__

[
[ 1  2  3  4]
[ 5  6  7  8]
[ 9 10 11 12]
]

[
[14 18 22]
[30 34 38]
]

[
[54 63]
]

[
[78]
]

[
[ 1  2  3  4]
[ 5  6  7  8]
[ 9 10 11 12]
]

Looks like my function behaves as expected.

```###########################################################
###########################################################
###                                                     ###
###                                                     ###
###########################################################
###########################################################
```
```+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
+-----------------------------------------------------------+
2 |-+  +      +      +      +       +      +      +      +  +-|
|                         C   C                             |
|                                 C                         |
|                      C                                    |
|                      D  D   D   D  C                      |
1.5 |-+  A             C                                      +-|
|                  D                 D                      |
|                                        C                  |
|               C                        D                  |
|                                                           |
|     B         D                           D               |
1 |-+                                                       +-|
|           C                                               |
|                                               C           |
|           D                                   D           |
|                                                           |
|    A                                             D        |
0.5 |-+      C                                                +-|
|        D                                                  |
|                                                           |
|    B                                                 D    |
|                                                           |
|    DC                                                     |
0 |-+  DD                                                   +-|
|    +      +      +      +       +      +      +      +    |
+-----------------------------------------------------------+
0     200    400    600     800    1000   1200   1400
sms_WxH_PDL_naive    A
sms_WxH_pdlpp_4loops    B
sms_WxH_pdlpp_3loops    C
sms_WxH_PDL_lags    D
+------+-------+-------+-------+-------+
| N    | A     | B     | C     | D     |
+------+-------+-------+-------+-------+
| 2    | 0.038 | 0.009 | 0.009 | 0.047 |
| 4    | 0.294 | 0.031 | 0.019 | 0.041 |
| 6    | 0.572 | 0.072 | 0.019 | 0.044 |
| 10   | 1.531 | 0.250 | 0.031 | 0.031 |
| 20   |       | 1.150 | 0.072 | 0.022 |
| 100  |       |       | 0.494 | 0.394 |
| 200  |       |       | 0.931 | 0.797 |
| 300  |       |       | 1.263 | 1.153 |
| 400  |       |       | 1.544 | 1.431 |
| 500  |       |       | 1.756 | 1.594 |
| 600  |       |       | 1.869 | 1.653 |
| 700  |       |       | 1.872 | 1.641 |
| 800  |       |       | 1.788 | 1.594 |
| 900  |       |       | 1.597 | 1.481 |
| 1000 |       |       | 1.362 | 1.288 |
| 1100 |       |       | 1.144 | 1.081 |
| 1200 |       |       | 0.841 | 0.809 |
| 1300 |       |       | 0.575 | 0.572 |
| 1400 |       |       | 0.291 | 0.272 |
+------+-------+-------+-------+-------+

I have no inclination to wait for many-many seconds (hours?) for "A" and "B" to draw their bell-like plots. Naive code ("A") is perhaps useless beyond 2x2 for large matrices, but for that my original 2x2 version can be used instead. "B" was probably useless to begin with; it's only there for completeness sake. What I'm really (pleasantly) surprised about -- my "lags" "D" solution beats PP/XS code i.e. "C", which, to my understanding, should do the same things verbatim at low level.

```###########################################################
###########################################################
###                                                     ###
###  But where's "sliding sum" solution as advertised?  ###
###                                                     ###
###########################################################
###########################################################
```

Ah-h, at last! Consider:

```sub _do_dimension ( \$m, \$w ) {
\$m -> slice([ 0, \$w - 1 ], [])
-> sumover
-> slice( '*' )
-> append(
\$w >= ( \$m-> dims )[ 0 ]
? empty
: \$m-> slice([ \$w, -1 ], []) -
\$m-> slice([ 0, -1 - \$w ], []))
-> cumusumover
-> xchg( 0, 1 )
}

sub sms_WxH_PDL_sliding ( \$m, \$w, \$h ) {
_do_dimension
_do_dimension( \$m, \$w ), \$h
}

\$m = 1 + sequence 4, 3;
print \$m,
sms_WxH_PDL_sliding( \$m, 2, 2 ),
sms_WxH_PDL_sliding( \$m, 3, 3 ),
sms_WxH_PDL_sliding( \$m, 4, 3 ),
sms_WxH_PDL_sliding( \$m, 1, 1 );

__END__

# same (correct) output as above, skipped

+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
+-----------------------------------------------------------+
|    +      +      +      +       +      +      +      +    |
|                             A                             |
|                         A                                 |
|                      A          A                         |
1.5 |-+                                  A                    +-|
|                  A                                        |
|                                                           |
|                                        A                  |
|                                                           |
|               A                                           |
|                                           A               |
1 |-+                                                       +-|
|                                                           |
|                                               A           |
|           A                                               |
|                                                           |
|                                                           |
|                                                  A        |
0.5 |-+                                                       +-|
|                                                           |
|        A                                                  |
|                                                      A    |
|                                                           |
|    B          B                                           |
|    BB  B  B      B   B  B   B   B  B          B           |
0 |-+                                      B  B      B   B  +-|
|    +      +      +      +       +      +      +      +    |
+-----------------------------------------------------------+
0     200    400    600     800    1000   1200   1400
sms_WxH_PDL_lags    A
sms_WxH_PDL_sliding    B
+------+-------+-------+
| N    | A     | B     |
+------+-------+-------+
| 2    | 0.025 | 0.100 |
| 4    | 0.034 | 0.075 |
| 6    | 0.016 | 0.084 |
| 10   | 0.034 | 0.091 |
| 20   | 0.028 | 0.084 |
| 100  | 0.369 | 0.081 |
| 200  | 0.800 | 0.078 |
| 300  | 1.122 | 0.116 |
| 400  | 1.416 | 0.050 |
| 500  | 1.572 | 0.053 |
| 600  | 1.644 | 0.022 |
| 700  | 1.675 | 0.025 |
| 800  | 1.587 | 0.019 |
| 900  | 1.456 | 0.019 |
| 1000 | 1.297 | 0.009 |
| 1100 | 1.047 | 0.009 |
| 1200 | 0.813 | 0.019 |
| 1300 | 0.566 | 0.006 |
| 1400 | 0.275 | 0.003 |
+------+-------+-------+
```###########################################################
###########################################################
###                                                     ###
###  It was trivial. CPAN surely has something already. ###
###            You have wasted your time.               ###
###                                                     ###
###########################################################
###########################################################
```

Eh? I've searched CPAN for "PDL sliding". There's the PDL::Apply. I have to wrap one of the functions it provides, to get "sliding submatrix sum", as

```use PDL::Apply 'apply_rolling';

sub sms_WxH_PDL_ar ( \$m, \$w, \$h ) {
\$m-> apply_rolling( \$w, 'sum' )
-> slice( [ \$w - 1, -1 ], [] )
-> xchg( 0, 1 )
-> apply_rolling( \$h, 'sum' )
-> slice( [ \$h - 1, -1 ], [] )
-> xchg( 0, 1 )
}

my \$m = random 150, 150;
\$t = time;
sms_WxH_PDL_ar( \$m, 2, 2 );
say time - \$t;
\$t = time;
sms_WxH_PDL_ar( \$m, 10, 10 );
say time - \$t;
\$t = time;
sms_WxH_PDL_ar( \$m, 75, 75 );
say time - \$t;

__END__

3.36784410476685
3.20836806297302
1.28170394897461

No plots/tables here. Note, it's back to 150x150 matrix i.e. "pure Perl" league, and hugely less performant than "pure Perl" non-sliding function at the top of the node. Did I use it wrong? Not sure what to make of it.

```###########################################################
###########################################################
###                                                     ###
###                  OK. Conclusions?                   ###
###                                                     ###
###########################################################
###########################################################
```
1. PDL is as great as Perl itself, and as fun to use/explore as ever. :-)
2. Maybe I'd use my sms_2x2_PDL for 2x2 sums, then sms_WxH_PDL_lags for relatively small submatrices, then switch to sms_WxH_PDL_sliding.
3. If you (i.e. me) are not qualified to write performant C code, don't expect to write PP/XS/C to perform better than your high-level language code, only by virtue of "it's C!" and "modern compiler will optimize it for me" (see "D" vs. "C" case above).
4. Before you (i.e. me) megalomaniacally "crank up" exercises, try looking for better algorithms to solve them. :-)

Replies are listed 'Best First'.
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by choroba (Cardinal) on Dec 26, 2023 at 12:42 UTC
Great post! ++ What a shame it's an anonymous one.

Have you compared your solution with wlmb's solution which uses PDL, too?

I started with PDL, as well, but I got frustrated soon and submitted a nested loops solution.

map{substr\$_->[0],\$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Yes, I have. His solution and my sms_2x2_PDL look almost the same, but there's a difference: he supplies a list of slices as argument to pdl() constructor. Such list, regardless of its length, uses almost zero resources, because slice, as many other e.g. PDL::Slice functions, creates virtual ndarray i.e. header pointing to original data. The constructor, however, builds fresh new ndarray -- and that even before any summation has begun. Probably, temporary pike in memory usage is negligible (*) in case of 2x2 submatrices. Usage for WxH case is perhaps impractical and only speculative, but, suppose 1500x1500 data, 10x10 frame -- which means 100 slices 1490*1490 each i.e. ~1,7 Gb temporary monster.

*: but why not measure its impact:

```sub sms_2x2_PDL_wlmb ( \$m ) {
pdl(
\$m-> slice( '0:-2,0:-2' ),
\$m-> slice( '1:-1,0:-2' ),
\$m-> slice( '0:-2,1:-1' ),
\$m-> slice( '1:-1,1:-1' ),
)-> mv( -1, 0 )-> sumover
}

__END__

Time (s) vs. N (2x2 submatrix, NxN matrix)

+-----------------------------------------------------------+
|+          +         +          +          +          +    |
|                                                      B    |
|                                                           |
|                                                           |
|                                                           |
1.5 |-+                                                       +-|
|                                                           |
|                                                           |
|                                                           |
|                                           B               |
|                                                           |
1 |-+                                                       +-|
|                                                           |
|                                                           |
|                                                           |
|                                                      A    |
|                                                           |
|                                B                          |
0.5 |-+                                                       +-|
|                                           A               |
|                                                           |
|                     B                                     |
|                 B              A                          |
|             B       A                                     |
|        B    A   A                                         |
0 |-+  B   A                                                +-|
|+          +         +          +          +          +    |
+-----------------------------------------------------------+
0         1000      2000       3000       4000       5000
sms_2x2_PDL    A
sms_2x2_PDL_wlmb    B

+------+-------+-------+
| N    | A     | B     |
+------+-------+-------+
| 400  | 0.002 | 0.011 |
| 800  | 0.006 | 0.041 |
| 1200 | 0.023 | 0.098 |
| 1600 | 0.061 | 0.180 |
| 2000 | 0.109 | 0.273 |
| 3000 | 0.238 | 0.613 |
| 4000 | 0.433 | 1.148 |
| 5000 | 0.738 | 1.758 |
+------+-------+-------+
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by tybalt89 (Monsignor) on Dec 26, 2023 at 21:19 UTC

Interesting problem...
Since it's that time of year...

```q(May all your Christmases be white.) =~ s/Christmase/loop/r =~ s/whit
+e/implicit/r

Inspired by mention of "sliding".

```#!/usr/bin/perl

use strict; # https://theweeklychallenge.org/blog/perl-weekly-challeng
use warnings;
use List::AllUtils qw( zip_by reduce );
use Data::Dump qw( pp );

sub slide
{
my @new;
reduce { push @new, \$a + \$b; \$b } @_;
@new;
}

for ( [ [1,  2,  3,  4],
[5,  6,  7,  8],
[9, 10, 11, 12] ],
[ [1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1] ],
)
{
print 'Input: \$a = ', pp(\$_), "\n";
my @new = zip_by { [ @_ ] } map [ slide @\$_ ],
zip_by { [ @_ ] } map [ slide @\$_ ], @\$_;
print 'Output: \$b = ', pp(\@new), "\n\n";
}

Outputs

```Input: \$a = [[1 .. 4], [5 .. 8], [9 .. 12]]
Output: \$b = [[14, 18, 22], [30, 34, 38]]

Input: \$a = [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]
Output: \$b = [[2, 1, 0], [1, 2, 1], [0, 1, 2]]

And here's a WxH version. It's interesting what can be found in List::AllUtils...

```#!/usr/bin/perl

use strict; # https://theweeklychallenge.org/blog/perl-weekly-challeng
use warnings;
use List::AllUtils qw( sum zip_by reductions );
use Data::Dump qw( pp );

sub nslide # size, elements
{
my @q = splice @_, 0, shift;
return reductions { push @q, \$b; \$a + \$b - shift @q } sum(@q), @_;
}

my (\$width, \$height) = (2, 2);
for ( [ [1,  2,  3,  4],
[5,  6,  7,  8],
[9, 10, 11, 12] ],
[ [1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1] ],
)
{
print 'Input: \$a = ', pp(\$_), "\n";
my @new = zip_by { [ @_ ] } map [ nslide \$height, @\$_ ],
zip_by { [ @_ ] } map [ nslide \$width, @\$_ ], @\$_;
print 'Output: \$b = ', pp(\@new), "\n\n";
}
And here's a WxH version

Excellent. And here a slope for you to slide, sort of in splendid isolation, from a plot fanatic:

```sub sms_WxH_Perl_sliding_tybalt89 ( \$m, \$width, \$height ) {
my @new = zip_by { [ @_ ] } map [ nslide \$height, @\$_ ],
zip_by { [ @_ ] } map [ nslide \$width, @\$_ ], @\$m;
return \@new
}

__END__

Time (s) vs. N (NxN submatrix, 1500x1500 matrix)
+-----------------------------------------------------------+
1.8 |-+  +      +      +      +       +      +      +      +  +-|
|    A                                                      |
|    A                                                      |
1.6 |-+                                                       +-|
|        A                                                  |
|                                                           |
1.4 |-+                                                       +-|
|           A                                               |
|                                                           |
1.2 |-+             A                                         +-|
|                                                           |
|                  A                                        |
|                      A                                    |
1 |-+                                                       +-|
|                         A                                 |
|                                                           |
0.8 |-+                           A                           +-|
|                                                           |
|                                 A                         |
0.6 |-+                                  A                    +-|
|                                        A                  |
|                                                           |
0.4 |-+                                         A             +-|
|                                               A           |
|                                                  A        |
0.2 |-+                                                       +-|
|                                                      A    |
|    +      +      +      +       +      +      +      +    |
0 +-----------------------------------------------------------+
0     200    400    600     800    1000   1200   1400
sms_WxH_Perl_sliding_tybalt89    A
+------+-------+
| N    | A     |
+------+-------+
| 2    | 1.700 |
| 10   | 1.725 |
| 100  | 1.538 |
| 200  | 1.387 |
| 300  | 1.259 |
| 400  | 1.131 |
| 500  | 1.016 |
| 600  | 0.894 |
| 700  | 0.784 |
| 800  | 0.678 |
| 900  | 0.578 |
| 1000 | 0.484 |
| 1100 | 0.394 |
| 1200 | 0.309 |
| 1300 | 0.231 |
| 1400 | 0.153 |
+------+-------+
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by wlmb (Novice) on Dec 27, 2023 at 20:19 UTC

Very nice post and tests. I found something similar to the sliding average.

```use v5.36;
use PDL;
use PDL::Image2D;
my \$w=4;
my \$h=5;
my \$m=sequence(10,10); # or any other matrix
say \$m->conv2d(ones(\$w,\$h))->slice([floor((\$w-1)/2),floor(-(\$w+1)/2)],
[floor((\$h-1)/2),floor(-(\$h+1)/2)]);

I expected box2d to work also, but either it doesn't work or I haven't understood it yet.

Thanks, nice find. By the way, perhaps I've chosen imprecise word ("sliding"). Maybe in math/science, any function, applied to overlapping infixes, is said to be applied in sliding/moving/rolling manner. What I meant instead, with the word, -- "applied using computationally efficient/cheap algorithm".

Looks like conv2d does very honest (therefore not efficient/cheap) 4-nested-loops style calculation, in PP/XS/C -- OK for very small kernels. Here's overcrowded plot, but B and E cases would suffice to show they are the same breed.

```sub sms_WxH_PDL_conv2d ( \$m, \$w, \$h ) {
\$m -> conv2d( ones( \$w, \$h ))
-> slice(
[floor((\$w-1)/2),floor(-(\$w+1)/2)],
[floor((\$h-1)/2),floor(-(\$h+1)/2)] )
}

__END__

Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
+-----------------------------------------------------------+
1 |-+  +      +      +      +       +      +      +      +  +-|
|                         A                                 |
|                                                      E    |
|                                                           |
|                                                           |
|                                                           |
0.8 |-+                                                       +-|
|                                                      B    |
|                                                           |
|                                                           |
|                                                           |
0.6 |-+                A                                      +-|
|                                        E                  |
|                                                           |
|                                                           |
|                                                           |
0.4 |-+                                      B                +-|
|                                 E                         |
|                                                           |
|           A                                               |
|                         E       B                         |
0.2 |-+                                                       +-|
|                  E      B                                 |
|        A  D                                               |
|    D   D  E      D      D       D      D             D    |
|    E   E  C      C      C       C      C             C    |
0 |-+  B   C                                                +-|
|    +      +      +      +       +      +      +      +    |
+-----------------------------------------------------------+
2      4      6      8       10     12     14     16
sms_WxH_PDL_naive    A
sms_WxH_pdlpp_4loops    B
sms_WxH_PDL_lags    C
sms_WxH_PDL_sliding    D
sms_WxH_PDL_conv2d    E
+----+-------+-------+-------+-------+-------+
| N  | A     | B     | C     | D     | E     |
+----+-------+-------+-------+-------+-------+
| 2  | 0.061 | 0.009 | 0.030 | 0.083 | 0.020 |
| 3  | 0.120 | 0.019 | 0.013 | 0.073 | 0.042 |
| 4  | 0.252 | 0.030 | 0.039 | 0.098 | 0.066 |
| 6  | 0.566 | 0.078 | 0.028 | 0.078 | 0.141 |
| 8  | 0.963 | 0.139 | 0.033 | 0.081 | 0.237 |
| 10 |       | 0.248 | 0.031 | 0.078 | 0.366 |
| 12 |       | 0.388 | 0.033 | 0.078 | 0.531 |
| 16 |       | 0.728 | 0.041 | 0.069 | 0.928 |
+----+-------+-------+-------+-------+-------+

For large submatrices one can use a Fourier transforms to perform the convolution, as in

```use v5.36;
use PDL;
use PDL::FFT;
my \$matrix=pdl(shift); # Format "[[m11,m21...],[m21...]...]"
my \$width=my \$w=shift;
my \$height=my \$h=shift;
my \$small=ones(\$w%2?\$w:\$w+1, \$h%2?\$h:\$h+1);
\$small->slice(-1).=0, ++\$w unless \$w%2; # zero row and/or column for e
+ven kernels
\$small->slice([],-1).=0, ++\$h unless \$h%2;
my \$kernel=kernctr(\$matrix, \$small); #full kernel
my \$result=\$matrix->copy;
\$result->fftconvolve(\$kernel);
say "\$matrix \$width \$height -> ",
\$result->slice([floor((\$width-1)/2),floor(-(\$width+1)/2)],
[floor((\$height-1)/2),floor(-(\$height+1)/2)]);

It would be interesting to also compare it.

Three "my" and one ")" missing.
Otherwise ++👍

Thanks. I corrected my previous post.

Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by jo37 (Deacon) on Dec 30, 2023 at 16:56 UTC

The referenced challenge was inspired by the example given in PDL::Broadcasting, especially by the usage of range. Accordingly my own solution looks similar to the example:

```use strict;
use warinings;
use PDL;
use PDL::NiceSlice;

sub submatrix_sum {
my \$m = pdl @_;

\$m->range(ndcoords(\$m(1:,1:)), 2)->reorder(2, 3, 0, 1)
->clump(2)->sumover;
}

It can easily be extended to WxH submatrices:

```sub submatrix_sum {
my \$w = shift;
my \$h = shift;
my \$m = pdl @_;

\$m->range(ndcoords(\$m(\$w - 1:, \$h - 1:), [\$w, \$h])->reorder(2, 3,
+0, 1)
->clump(2)->sumover;
}

Greetings,
-jo

\$gryYup\$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p\$

Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:

```...-> sumover-> sumover

which you replaced with

```...-> clump(2)-> sumover

(Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.)

Consider:

```sub sms_WxH_PDL_range ( \$m, \$w, \$h ) {
my ( \$W, \$H ) = \$m-> dims;

\$m-> range( ndcoords( \$W - \$w + 1, \$H - \$h + 1 ), [ \$w, \$h ])
-> reorder( 2, 3, 0, 1 )
-> clump( 2 )
-> sumover
}

sub sms_WxH_PDL_range_b ( \$m, \$w, \$h ) {
my ( \$W, \$H ) = \$m-> dims;

\$m-> range( ndcoords( \$W - \$w + 1, \$H - \$h + 1 ), [ \$w, \$h ])
-> reorder( 2, 3, 0, 1 )
-> sumover
-> sumover
}

__END__

Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix)
+-----------------------------------------------------------+
|+          +         +          +          +          +    |
1.6 |-+                                                    A  +-|
|                                                           |
|                                                           |
1.4 |-+                                                       +-|
|                                                           |
1.2 |-+                                                       +-|
|                                           A               |
|                                                           |
1 |-+                                                       +-|
|                                                           |
|                                                      B    |
0.8 |-+                                                       +-|
|                                  A                        |
|                                                           |
0.6 |-+                                         B             +-|
|                                                           |
0.4 |-+                        A                              +-|
|                                  B                        |
|                     A                                     |
0.2 |-+               A        B                              +-|
|             A   B   B                                     |
|        A  B B                             D          D    |
0 |-+  D D D  D D   D   D    D       D        C          C  +-|
|+          +         +          +          +          +    |
+-----------------------------------------------------------+
0          5         10         15         20         25
sms_WxH_PDL_range    A
sms_WxH_PDL_range_b    B
sms_WxH_PDL_lags    C
sms_WxH_PDL_naive    D
+----+-------+-------+-------+-------+
| N  | A     | B     | C     | D     |
+----+-------+-------+-------+-------+
| 2  | 0.015 | 0.008 | 0.000 | 0.000 |
| 3  | 0.021 | 0.018 | 0.000 | 0.000 |
| 4  | 0.044 | 0.021 | 0.000 | 0.000 |
| 5  | 0.073 | 0.047 | 0.000 | 0.003 |
| 6  | 0.101 | 0.060 | 0.000 | 0.000 |
| 8  | 0.193 | 0.104 | 0.000 | 0.005 |
| 10 | 0.294 | 0.138 | 0.000 | 0.005 |
| 12 | 0.435 | 0.232 | 0.000 | 0.010 |
| 16 | 0.711 | 0.344 | 0.000 | 0.015 |
| 20 | 1.115 | 0.549 | 0.000 | 0.026 |
| 25 | 1.573 | 0.828 | 0.000 | 0.047 |
+----+-------+-------+-------+-------+

(I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.

It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:

```use strict;
use warnings;
use experimental qw/ say postderef signatures /;
use Time::HiRes 'time';

use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';

use constant STEPS => 100;
my \$x = zeroes( 200, 200 );

# Put in a simple glider.
\$x(1:3,1:3) .= pdl ( [1,1,1],
[0,0,1],
[0,1,0] );
my \$backup = \$x-> copy;

printf "Game of Life!\nMatrix: %s, %d generations\n",
\$x-> info, STEPS;

# Tutorial
my \$t = time;
my \$ct = 0;
for ( 1 .. STEPS ) {
my \$t_ = time;
# Calculate the number of neighbours per cell.
my \$n = \$x->range(ndcoords(\$x)-1,3,"periodic")->reorder(2,3,0,1);
\$n = \$n->sumover->sumover - \$x;
\$ct += time - \$t_;

# Calculate the next generation.
\$x = (((\$n == 2) + (\$n == 3))* \$x) + ((\$n==3) * !\$x);
}
printf "Tutorial: %0.3f s (core time: %0.3f)\n",
time - \$t, \$ct;

# "Lags"
my \$m = \$backup-> copy;
\$t = time;
\$ct = 0;
for ( 1 .. STEPS ) {
my \$t_ = time;
# Calculate the number of neighbours per cell.
my \$n = sms_GoL_lags( \$m ) - \$m;
\$ct += time - \$t_;

# Calculate the next generation.
\$m = (((\$n == 2) + (\$n == 3))* \$m) + ((\$n == 3) * !\$m);
}
printf "\"lags\":   %0.3f s (core time: %0.3f)\n",
time - \$t, \$ct;

die unless eq_pdl( \$x, \$m );

sub _do_dimension_GoL ( \$m ) {
\$m-> slice( -1 )-> glue( 0, \$m, \$m-> slice( 0 ))
-> lags( 0, 1, ( \$m-> dims )[0] )
-> sumover
-> slice( '', '-1:0' )
-> xchg( 0, 1 )
}

sub sms_GoL_lags ( \$m ) {
_do_dimension_GoL
_do_dimension_GoL \$m
}

__END__

Game of Life!
Matrix: PDL: Double D [200,200], 100 generations
Tutorial: 1.016 s (core time: 0.835)
"lags":   0.283 s (core time: 0.108)

Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. \$n == 3 array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.

Hi Steve (assuming it's you),

I haven't spent any effort on optimizing the solution. Thank you for running the tests! It is interesting to see that ->clump(2)->sumover doesn't perform as good as ->sumover->sumover. Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory. Looks like I'm getting older :-)

Greetings,
-jo

\$gryYup\$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p\$

Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from:

• To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function.
• Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4).
```use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';

use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';

use constant {
WIDTH  => 20,
HEIGHT => 20,
STEPS  => 1000,
};
my \$x = zeroes long, WIDTH, HEIGHT;

# Put in a simple glider.
\$x(1:3,1:3) .= pdl ( [1,1,1],
[0,0,1],
[0,1,0] );
my \$backup = \$x-> copy;

printf "Game of Life!\nMatrix: %s, %d generations\n",
\$x-> info, STEPS;

# Tutorial
my \$t = time;
for ( 1 .. STEPS ) {
my \$t_ = time;
# Calculate the number of neighbours per cell.
my \$n = \$x->range(ndcoords(\$x)-1,3,"periodic")->reorder(2,3,0,1);
\$n = \$n->sumover->sumover - \$x;

# Calculate the next generation.
\$x = (((\$n == 2) + (\$n == 3))* \$x) + ((\$n == 3) * !\$x);
}
printf "Tutorial: %0.3f s\n", time - \$t;

# Faster
my \$m = \$backup-> copy;
\$t = time;
my \$wrap_w = pdl [ reverse WIDTH - 1, ( 0 .. WIDTH - 1 ), 0 ];
my \$wrap_h = pdl [ reverse HEIGHT - 1, ( 0 .. HEIGHT - 1 ), 0 ];
for ( 1 .. STEPS ) {
my \$n = \$m
-> dice_axis( 0, \$wrap_w )
-> lags( 0, 1, WIDTH )
-> sumover
-> dice_axis( 1, \$wrap_h )
-> lags( 1, 1, HEIGHT )
-> xchg( 0, 1 )
-> sumover;
\$n -= \$m;
\$m = ( \$n == 3 ) | \$m & ( \$n == 2 )
}
printf "Faster:   %0.3f s\n", time - \$t;

die unless eq_pdl( \$x, \$m );

__END__

Game of Life!
Matrix: PDL: Long D [20,20], 1000 generations
Tutorial: 0.341 s
Faster:   0.111 s

Matrix: PDL: Long D [200,200], 100 generations
Tutorial: 0.845 s
Faster:   0.086 s

Matrix: PDL: Long D [1000,1000], 20 generations
Tutorial: 4.422 s
Faster:   0.443 s

Matrix: PDL: Long D [2000,2000], 10 generations
Tutorial: 8.878 s
Faster:   0.872 s
Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by etj (Deacon) on Feb 15, 2024 at 20:45 UTC
A belated note. This:
```pp_def('sms_WxH_pdlpp_4loops',
Pars      => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',

RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$COMP(par_sm_w) + 1;
\$SIZE(q) = \$SIZE(n) - \$COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, l, w, h, W, H;
w = \$COMP(par_sm_w);
h = \$COMP(par_sm_h);
W = \$SIZE(m);
H = \$SIZE(n);
for (i = 0; i < H - h + 1; i++) {
for (j = 0; j < W - w + 1; j++) {
for (k = 0; k < w; k++) {
for (l = 0; l < h; l++) {
\$b(p=>j, q=>i) += \$a(m=>j+k, n=>i+l);
}
}
}
}
',
);
Can use PP features much better (untested, illustrative):
```pp_def('sms_WxH_pdlpp_4loops',
Pars      => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w => w; PDL_Indx par_sm_h => h;',
RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$SIZE(w) + 1;
\$SIZE(q) = \$SIZE(n) - \$SIZE(h) + 1;',
Code => 'loop(q,p,w,h) %{ \$b() += \$a(m=>p+w, n=>q+h); %}',
);
Features used above: OtherPars can set \$SIZE of dims not otherwise used. When something is a \$SIZE, you can loop() over it. If an index (named dim) is indexed by a variable of the same name, it can be omitted (as for \$b() above). Obviously I used the fact that you can give several loops at once, which is very neat. Finally, your loop conditions were recalculating the expressions used to set \$SIZE(p) etc.

Looking at the loop(), I suspect you want to switch the w and h loops to do everything as much as possible on the fastest-moving dimension. This algorithm would probably benefit a lot from tiling for maximum cache-hitting, which should probably become a proper PP feature. If someone wants to open a PDL issue with suggestions as to what the syntax would look like, it will probably get implemented and used on matmult.

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11156532]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-25 06:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found