Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:

...-> sumover-> sumover

which you replaced with

...-> clump(2)-> sumover

(Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.)

Consider:

sub sms_WxH_PDL_range ( \$m, \$w, \$h ) { my ( \$W, \$H ) = \$m-> dims; \$m-> range( ndcoords( \$W - \$w + 1, \$H - \$h + 1 ), [ \$w, \$h ]) -> reorder( 2, 3, 0, 1 ) -> clump( 2 ) -> sumover } sub sms_WxH_PDL_range_b ( \$m, \$w, \$h ) { my ( \$W, \$H ) = \$m-> dims; \$m-> range( ndcoords( \$W - \$w + 1, \$H - \$h + 1 ), [ \$w, \$h ]) -> reorder( 2, 3, 0, 1 ) -> sumover -> sumover } __END__ Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix) +-----------------------------------------------------------+ |+ + + + + + | 1.6 |-+ A +-| | | | | 1.4 |-+ +-| | | 1.2 |-+ +-| | A | | | 1 |-+ +-| | | | B | 0.8 |-+ +-| | A | | | 0.6 |-+ B +-| | | 0.4 |-+ A +-| | B | | A | 0.2 |-+ A B +-| | A B B | | A B B D D | 0 |-+ D D D D D D D D D C C +-| |+ + + + + + | +-----------------------------------------------------------+ 0 5 10 15 20 25 sms_WxH_PDL_range A sms_WxH_PDL_range_b B sms_WxH_PDL_lags C sms_WxH_PDL_naive D +----+-------+-------+-------+-------+ | N | A | B | C | D | +----+-------+-------+-------+-------+ | 2 | 0.015 | 0.008 | 0.000 | 0.000 | | 3 | 0.021 | 0.018 | 0.000 | 0.000 | | 4 | 0.044 | 0.021 | 0.000 | 0.000 | | 5 | 0.073 | 0.047 | 0.000 | 0.003 | | 6 | 0.101 | 0.060 | 0.000 | 0.000 | | 8 | 0.193 | 0.104 | 0.000 | 0.005 | | 10 | 0.294 | 0.138 | 0.000 | 0.005 | | 12 | 0.435 | 0.232 | 0.000 | 0.010 | | 16 | 0.711 | 0.344 | 0.000 | 0.015 | | 20 | 1.115 | 0.549 | 0.000 | 0.026 | | 25 | 1.573 | 0.828 | 0.000 | 0.047 | +----+-------+-------+-------+-------+

(I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.

It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:

use strict; use warnings; use experimental qw/ say postderef signatures /; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant STEPS => 100; my \$x = zeroes( 200, 200 ); # Put in a simple glider. \$x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my \$backup = \$x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", \$x-> info, STEPS; # Tutorial my \$t = time; my \$ct = 0; for ( 1 .. STEPS ) { my \$t_ = time; # Calculate the number of neighbours per cell. my \$n = \$x->range(ndcoords(\$x)-1,3,"periodic")->reorder(2,3,0,1); \$n = \$n->sumover->sumover - \$x; \$ct += time - \$t_; # Calculate the next generation. \$x = (((\$n == 2) + (\$n == 3))* \$x) + ((\$n==3) * !\$x); } printf "Tutorial: %0.3f s (core time: %0.3f)\n", time - \$t, \$ct; # "Lags" my \$m = \$backup-> copy; \$t = time; \$ct = 0; for ( 1 .. STEPS ) { my \$t_ = time; # Calculate the number of neighbours per cell. my \$n = sms_GoL_lags( \$m ) - \$m; \$ct += time - \$t_; # Calculate the next generation. \$m = (((\$n == 2) + (\$n == 3))* \$m) + ((\$n == 3) * !\$m); } printf "\"lags\": %0.3f s (core time: %0.3f)\n", time - \$t, \$ct; die unless eq_pdl( \$x, \$m ); sub _do_dimension_GoL ( \$m ) { \$m-> slice( -1 )-> glue( 0, \$m, \$m-> slice( 0 )) -> lags( 0, 1, ( \$m-> dims )[0] ) -> sumover -> slice( '', '-1:0' ) -> xchg( 0, 1 ) } sub sms_GoL_lags ( \$m ) { _do_dimension_GoL _do_dimension_GoL \$m } __END__ Game of Life! Matrix: PDL: Double D [200,200], 100 generations Tutorial: 1.016 s (core time: 0.835) "lags": 0.283 s (core time: 0.108)

Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. \$n == 3 array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.

