http://www.perlmonks.org?node_id=11156532

```Task 2: Submatrix Sum
Submitted by: Jorg Sommrey

You are given a NxM matrix A of integers.

Write a script to construct a (N-1)x(M-1) matrix B
having elements that are the sum over the 2x2 submatrices of A,

b[i,k] = a[i,k] + a[i,k+1] + a[i+1,k] + a[i+1,k+1]

Example 1

Input: \$a = [
[1,  2,  3,  4],
[5,  6,  7,  8],
[9, 10, 11, 12]
]

Output: \$b = [
[14, 18, 22],
[30, 34, 38]
]

I tried to crank it up for entertainment -- to solve not "2x2", but arbitrary "WxH" sub-matrices of (relatively) large matrices. Sadly, it was too late into many stupid plots and tables, when I realized, that excessive summation isn't required at all. Sliding (moving) sum (or average, min/max, etc.) is a well known concept -- duh! not to un-educated me, alas. And so maybe "2x2", not "WxH", in the task, was selected not only because PWC tries to accommodate new learners. Anyway, I implemented "sliding submatrix sums" algorithm only at a later stage, in PDL, and didn't "backport" it to pure Perl (nor Pdlpp) because had already been somewhat fed-up with this 248-2, but decided to present sanitized results as a meditation.

(This is "meditation", there's no clean runnable complete script as such. Everything ended up in a kind of messy draft-like state. Benchmarking subroutine from my previous mediation was tweaked and re-used. No testing for correct input e.g. if requested sub-matrix size is larger than matrix. "You are given array of integers" in PWC task is as unavoidable as "Once upon a time" in a fairy tale, and was ignored -- all data below are doubles. That said, I'll gladly provide any details if requested.)

Almost all Perl solutions of "PWC 248-2" utilize 2 nested loops and then sum the 4 terms, see the task as stated. Just a few explicitly expose 4 nested loops, where 2 innermost of them loop just over (0, 1) each. It penalizes performance for these "few" solutions by about a factor of 2, in my tests, but, looking at their code, it is tempting to use it as template for WxH submatrix sums. However, it quickly became clear that depth of 3, not 4, of nested loops is enough if sums of partial rows are re-used i.e. stored in temporary array. Consider:

```sub sms_WxH_Perl_4loops( \$m, \$w, \$h ) {
my ( \$W, \$H ) = ( scalar \$m-> [0]-> @*, scalar @\$m );
my @ret;
for my \$i ( 0 .. \$H - \$h ) {
for my \$j ( 0 .. \$W - \$w ) {
for my \$k ( 0 .. \$w - 1 ) {
for my \$l ( 0 .. \$h - 1 ) {
\$ret[ \$i ][ \$j ] += \$m-> [ \$i + \$l ][ \$j + \$k ];
}
}
}
}
return \@ret
}

sub sms_WxH_Perl_3loops( \$m, \$w, \$h ) {
my ( \$W, \$H ) = ( scalar \$m-> [0]-> @*, scalar @\$m );
my ( @tmp, @ret );
for my \$i ( 0 .. \$H - 1 ) {
for my \$j ( 0 .. \$W - \$w ) {
for my \$k ( 0 .. \$w - 1 ) {
\$tmp[ \$i ][ \$j ] += \$m-> [ \$i ][ \$j + \$k ];
}
next if \$i < \$h - 1;
for my \$k ( 0 .. \$h - 1 ) {
\$ret[ \$i - \$h + 1 ][ \$j ] += \$tmp[ \$i - \$k ][ \$j ];
}
}
}
return \@ret
}

\$m = [
map [
map rand, 0 .. 149
], 0 .. 149
];

__END__

+
Time (s) vs. N (NxN submatrix, 150x150 matrix)
+----------------------------------------------------------+
|+       +      +       +      +       +       +      +    |
10 |-+                                                      +-|
|+                                                        +|
|+                                                        +|
|+                      A   A  A   A                      +|
|+                  A                  A                  +|
|+              A                          A              +|
|           A                                  A           |
|                                                          |
1 |-+                                               A      +-|
|+       A                                                +|
|+                                                        +|
|+                                                        +|
|+                                                    A   +|
|    A                                                     |
|+                                                        +|
|               B   B   B   B  B   B                       |
0.1 |-+      B  B                          B                 +-|
|+                                         B              +|
|+   B                                         B          +|
|+                                                B       +|
|+                                                        +|
|                                                          |
|+                                                    B   +|
|                                                          |
0.01 |-+                                                      +-|
|+       +      +       +      +       +       +      +   +|
+----------------------------------------------------------+
0       20     40      60     80     100     120    140
sms_WxH_Perl_4loops    A
sms_WxH_Perl_3loops    B
+-----+-------+-------+
| N   | A     | B     |
+-----+-------+-------+
| 10  | 0.250 | 0.053 |
| 20  | 0.809 | 0.091 |
| 30  | 1.528 | 0.116 |
| 40  | 2.288 | 0.134 |
| 50  | 2.988 | 0.144 |
| 60  | 3.512 | 0.147 |
| 70  | 3.828 | 0.144 |
| 80  | 3.862 | 0.134 |
| 90  | 3.628 | 0.122 |
| 100 | 3.122 | 0.103 |
| 110 | 2.453 | 0.088 |
| 120 | 1.675 | 0.066 |
| 130 | 0.900 | 0.044 |
| 140 | 0.284 | 0.022 |
+-----+-------+-------+

Switching to caching and depth of "3" pays off quite handsomely.

```###########################################################
###########################################################
###                                                     ###
###   Enough of "pure Perl". Show us PDL as promised!   ###
###                                                     ###
###########################################################
###########################################################
```

My initial (i.e. 2x2) PDL solution was:

```sub sms_2x2_PDL ( \$m ) {
\$m-> slice( '0:-2', '0:-2' ) +
\$m-> slice( '1:-1', '0:-2' ) +
\$m-> slice( '0:-2', '1:-1' ) +
\$m-> slice( '1:-1', '1:-1' )
}

```sub sms_WxH_PDL_naive ( \$m, \$w, \$h ) {
my \$ret = 0;

for my \$r ( 0 .. \$h - 1 ) {
my \$r_sel = "\$r:" . ( \$r - \$h );
for my \$c ( 0 .. \$w - 1 ) {
\$ret += \$m-> slice( "\$c:" . ( \$c - \$w ), \$r_sel )
}
}
return \$ret
}

It doesn't perform too great (see further), and looks like "grass snake and hedgehog cross-breed" -- neither Perl nor "pure" PDL, which is supposed to be loop-less. Then I (sneakily) browsed PWC repository for 248-2 PDL solutions, if any -- to "borrow" inspiration -- and found Pdlpp one, with which I had no previous experience. All the more interesting challenge to "translate" "4" and "3 loops" Perl code to Pdlpp:

```use Inline Pdlpp => <<'EOPP';

pp_def('sms_WxH_pdlpp_4loops',
Pars      => 'a(m,n); [o]b(p,q);',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',

RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$COMP(par_sm_w) + 1;
\$SIZE(q) = \$SIZE(n) - \$COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, l, w, h, W, H;
w = \$COMP(par_sm_w);
h = \$COMP(par_sm_h);
W = \$SIZE(m);
H = \$SIZE(n);
for (i = 0; i < H - h + 1; i++) {
for (j = 0; j < W - w + 1; j++) {
for (k = 0; k < w; k++) {
for (l = 0; l < h; l++) {
\$b(p=>j, q=>i) += \$a(m=>j+k, n=>i+l);
}
}
}
}
',
);

pp_def('sms_WxH_pdlpp_3loops',
Pars      => 'a(m,n); [o]b(p,q); [t]t(p,n)',
OtherPars => 'PDL_Indx par_sm_w; PDL_Indx par_sm_h;',

RedoDimsCode => '\$SIZE(p) = \$SIZE(m) - \$COMP(par_sm_w) + 1;
\$SIZE(q) = \$SIZE(n) - \$COMP(par_sm_h) + 1;',
Code => '
PDL_Indx i, j, k, w, h, W, H;
w = \$COMP(par_sm_w);
h = \$COMP(par_sm_h);
W = \$SIZE(m);
H = \$SIZE(n);

for (i = 0; i < H; i++) {
for (j = 0; j < W - w + 1; j++) {
for (k = 0; k < w; k++) {
\$t(p=>j, n=>i) += \$a(m=>j+k, n=>i);
}
if (i >= h - 1) {
for (k = 0; k < h; k++) {
\$b(p=>j, q=>i-h+1) += \$t(p=>j, n=>i-k);
}
}
}
}
',
);

EOPP

So, yeah, it was a challenge, but was it fun? It's supposed to translate to XS and C and compile to native binary code, to be the fastest, right? Am I even qualified to see if I blundered somehow about it, above?

Not fun for sure. Wanted to use high-level array language, finished with low-level C-like thing. I then looked further through PDL assortment of standard functions -- and found "lags", which can slice an array into overlapping infixes. It is somewhat odd, because expects, as argument, not sliding window size, but something else to compute it (see further), and returns its result in reverse order -- hence I have to flip it back. Consider:

```sub sms_WxH_PDL_lags ( \$m, \$w, \$h ) {
my ( \$W, \$H ) = \$m-> dims;

return \$m
-> lags( 0, 1, 1 + \$W - \$w )
-> sumover
-> lags( 1, 1, 1 + \$H - \$h )
-> xchg( 0, 1 )
-> sumover
-> slice( '-1:0,-1:0' )
}

\$m = 1 + sequence 4, 3;
print \$m,
sms_WxH_PDL_lags( \$m, 2, 2 ),
sms_WxH_PDL_lags( \$m, 3, 3 ),
sms_WxH_PDL_lags( \$m, 4, 3 ),
sms_WxH_PDL_lags( \$m, 1, 1 );

__END__

[
[ 1  2  3  4]
[ 5  6  7  8]
[ 9 10 11 12]
]

[
[14 18 22]
[30 34 38]
]

[
[54 63]
]

[
[78]
]

[
[ 1  2  3  4]
[ 5  6  7  8]
[ 9 10 11 12]
]

Looks like my function behaves as expected.

```###########################################################
###########################################################
###                                                     ###
###                                                     ###
###########################################################
###########################################################
```
```+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
+-----------------------------------------------------------+
2 |-+  +      +      +      +       +      +      +      +  +-|
|                         C   C                             |
|                                 C                         |
|                      C                                    |
|                      D  D   D   D  C                      |
1.5 |-+  A             C                                      +-|
|                  D                 D                      |
|                                        C                  |
|               C                        D                  |
|                                                           |
|     B         D                           D               |
1 |-+                                                       +-|
|           C                                               |
|                                               C           |
|           D                                   D           |
|                                                           |
|    A                                             D        |
0.5 |-+      C                                                +-|
|        D                                                  |
|                                                           |
|    B                                                 D    |
|                                                           |
|    DC                                                     |
0 |-+  DD                                                   +-|
|    +      +      +      +       +      +      +      +    |
+-----------------------------------------------------------+
0     200    400    600     800    1000   1200   1400
sms_WxH_PDL_naive    A
sms_WxH_pdlpp_4loops    B
sms_WxH_pdlpp_3loops    C
sms_WxH_PDL_lags    D
+------+-------+-------+-------+-------+
| N    | A     | B     | C     | D     |
+------+-------+-------+-------+-------+
| 2    | 0.038 | 0.009 | 0.009 | 0.047 |
| 4    | 0.294 | 0.031 | 0.019 | 0.041 |
| 6    | 0.572 | 0.072 | 0.019 | 0.044 |
| 10   | 1.531 | 0.250 | 0.031 | 0.031 |
| 20   |       | 1.150 | 0.072 | 0.022 |
| 100  |       |       | 0.494 | 0.394 |
| 200  |       |       | 0.931 | 0.797 |
| 300  |       |       | 1.263 | 1.153 |
| 400  |       |       | 1.544 | 1.431 |
| 500  |       |       | 1.756 | 1.594 |
| 600  |       |       | 1.869 | 1.653 |
| 700  |       |       | 1.872 | 1.641 |
| 800  |       |       | 1.788 | 1.594 |
| 900  |       |       | 1.597 | 1.481 |
| 1000 |       |       | 1.362 | 1.288 |
| 1100 |       |       | 1.144 | 1.081 |
| 1200 |       |       | 0.841 | 0.809 |
| 1300 |       |       | 0.575 | 0.572 |
| 1400 |       |       | 0.291 | 0.272 |
+------+-------+-------+-------+-------+

I have no inclination to wait for many-many seconds (hours?) for "A" and "B" to draw their bell-like plots. Naive code ("A") is perhaps useless beyond 2x2 for large matrices, but for that my original 2x2 version can be used instead. "B" was probably useless to begin with; it's only there for completeness sake. What I'm really (pleasantly) surprised about -- my "lags" "D" solution beats PP/XS code i.e. "C", which, to my understanding, should do the same things verbatim at low level.

```###########################################################
###########################################################
###                                                     ###
###  But where's "sliding sum" solution as advertised?  ###
###                                                     ###
###########################################################
###########################################################
```

Ah-h, at last! Consider:

```sub _do_dimension ( \$m, \$w ) {
\$m -> slice([ 0, \$w - 1 ], [])
-> sumover
-> slice( '*' )
-> append(
\$w >= ( \$m-> dims )[ 0 ]
? empty
: \$m-> slice([ \$w, -1 ], []) -
\$m-> slice([ 0, -1 - \$w ], []))
-> cumusumover
-> xchg( 0, 1 )
}

sub sms_WxH_PDL_sliding ( \$m, \$w, \$h ) {
_do_dimension
_do_dimension( \$m, \$w ), \$h
}

\$m = 1 + sequence 4, 3;
print \$m,
sms_WxH_PDL_sliding( \$m, 2, 2 ),
sms_WxH_PDL_sliding( \$m, 3, 3 ),
sms_WxH_PDL_sliding( \$m, 4, 3 ),
sms_WxH_PDL_sliding( \$m, 1, 1 );

__END__

# same (correct) output as above, skipped

+
Time (s) vs. N (NxN submatrix, PDL: Double D [1500,1500] matrix)
+-----------------------------------------------------------+
|    +      +      +      +       +      +      +      +    |
|                             A                             |
|                         A                                 |
|                      A          A                         |
1.5 |-+                                  A                    +-|
|                  A                                        |
|                                                           |
|                                        A                  |
|                                                           |
|               A                                           |
|                                           A               |
1 |-+                                                       +-|
|                                                           |
|                                               A           |
|           A                                               |
|                                                           |
|                                                           |
|                                                  A        |
0.5 |-+                                                       +-|
|                                                           |
|        A                                                  |
|                                                      A    |
|                                                           |
|    B          B                                           |
|    BB  B  B      B   B  B   B   B  B          B           |
0 |-+                                      B  B      B   B  +-|
|    +      +      +      +       +      +      +      +    |
+-----------------------------------------------------------+
0     200    400    600     800    1000   1200   1400
sms_WxH_PDL_lags    A
sms_WxH_PDL_sliding    B
+------+-------+-------+
| N    | A     | B     |
+------+-------+-------+
| 2    | 0.025 | 0.100 |
| 4    | 0.034 | 0.075 |
| 6    | 0.016 | 0.084 |
| 10   | 0.034 | 0.091 |
| 20   | 0.028 | 0.084 |
| 100  | 0.369 | 0.081 |
| 200  | 0.800 | 0.078 |
| 300  | 1.122 | 0.116 |
| 400  | 1.416 | 0.050 |
| 500  | 1.572 | 0.053 |
| 600  | 1.644 | 0.022 |
| 700  | 1.675 | 0.025 |
| 800  | 1.587 | 0.019 |
| 900  | 1.456 | 0.019 |
| 1000 | 1.297 | 0.009 |
| 1100 | 1.047 | 0.009 |
| 1200 | 0.813 | 0.019 |
| 1300 | 0.566 | 0.006 |
| 1400 | 0.275 | 0.003 |
+------+-------+-------+
```###########################################################
###########################################################
###                                                     ###
###  It was trivial. CPAN surely has something already. ###
###            You have wasted your time.               ###
###                                                     ###
###########################################################
###########################################################
```

Eh? I've searched CPAN for "PDL sliding". There's the PDL::Apply. I have to wrap one of the functions it provides, to get "sliding submatrix sum", as

```use PDL::Apply 'apply_rolling';

sub sms_WxH_PDL_ar ( \$m, \$w, \$h ) {
\$m-> apply_rolling( \$w, 'sum' )
-> slice( [ \$w - 1, -1 ], [] )
-> xchg( 0, 1 )
-> apply_rolling( \$h, 'sum' )
-> slice( [ \$h - 1, -1 ], [] )
-> xchg( 0, 1 )
}

my \$m = random 150, 150;
\$t = time;
sms_WxH_PDL_ar( \$m, 2, 2 );
say time - \$t;
\$t = time;
sms_WxH_PDL_ar( \$m, 10, 10 );
say time - \$t;
\$t = time;
sms_WxH_PDL_ar( \$m, 75, 75 );
say time - \$t;

__END__

3.36784410476685
3.20836806297302
1.28170394897461

No plots/tables here. Note, it's back to 150x150 matrix i.e. "pure Perl" league, and hugely less performant than "pure Perl" non-sliding function at the top of the node. Did I use it wrong? Not sure what to make of it.

```###########################################################
###########################################################
###                                                     ###
###                  OK. Conclusions?                   ###
###                                                     ###
###########################################################
###########################################################
```
1. PDL is as great as Perl itself, and as fun to use/explore as ever. :-)
2. Maybe I'd use my sms_2x2_PDL for 2x2 sums, then sms_WxH_PDL_lags for relatively small submatrices, then switch to sms_WxH_PDL_sliding.
3. If you (i.e. me) are not qualified to write performant C code, don't expect to write PP/XS/C to perform better than your high-level language code, only by virtue of "it's C!" and "modern compiler will optimize it for me" (see "D" vs. "C" case above).
4. Before you (i.e. me) megalomaniacally "crank up" exercises, try looking for better algorithms to solve them. :-)