Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)

by jo37 (Deacon)
on Dec 30, 2023 at 16:56 UTC ( [id://11156598]=note: print w/replies, xml ) Need Help??


in reply to Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)

The referenced challenge was inspired by the example given in PDL::Broadcasting, especially by the usage of range. Accordingly my own solution looks similar to the example:

use strict; use warinings; use PDL; use PDL::NiceSlice; sub submatrix_sum { my $m = pdl @_; $m->range(ndcoords($m(1:,1:)), 2)->reorder(2, 3, 0, 1) ->clump(2)->sumover; }

It can easily be extended to WxH submatrices:

sub submatrix_sum { my $w = shift; my $h = shift; my $m = pdl @_; $m->range(ndcoords($m($w - 1:, $h - 1:), [$w, $h])->reorder(2, 3, +0, 1) ->clump(2)->sumover; }

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

Replies are listed 'Best First'.
Re^2: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by Anonymous Monk on Dec 31, 2023 at 16:07 UTC

    Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:

    ...-> sumover-> sumover

    which you replaced with

    ...-> clump(2)-> sumover

    (Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.)

    Consider:

    sub sms_WxH_PDL_range ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> clump( 2 ) -> sumover } sub sms_WxH_PDL_range_b ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> sumover -> sumover } __END__ Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix) +-----------------------------------------------------------+ |+ + + + + + | 1.6 |-+ A +-| | | | | 1.4 |-+ +-| | | 1.2 |-+ +-| | A | | | 1 |-+ +-| | | | B | 0.8 |-+ +-| | A | | | 0.6 |-+ B +-| | | 0.4 |-+ A +-| | B | | A | 0.2 |-+ A B +-| | A B B | | A B B D D | 0 |-+ D D D D D D D D D C C +-| |+ + + + + + | +-----------------------------------------------------------+ 0 5 10 15 20 25 sms_WxH_PDL_range A sms_WxH_PDL_range_b B sms_WxH_PDL_lags C sms_WxH_PDL_naive D +----+-------+-------+-------+-------+ | N | A | B | C | D | +----+-------+-------+-------+-------+ | 2 | 0.015 | 0.008 | 0.000 | 0.000 | | 3 | 0.021 | 0.018 | 0.000 | 0.000 | | 4 | 0.044 | 0.021 | 0.000 | 0.000 | | 5 | 0.073 | 0.047 | 0.000 | 0.003 | | 6 | 0.101 | 0.060 | 0.000 | 0.000 | | 8 | 0.193 | 0.104 | 0.000 | 0.005 | | 10 | 0.294 | 0.138 | 0.000 | 0.005 | | 12 | 0.435 | 0.232 | 0.000 | 0.010 | | 16 | 0.711 | 0.344 | 0.000 | 0.015 | | 20 | 1.115 | 0.549 | 0.000 | 0.026 | | 25 | 1.573 | 0.828 | 0.000 | 0.047 | +----+-------+-------+-------+-------+

    (I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.

    It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:

    use strict; use warnings; use experimental qw/ say postderef signatures /; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant STEPS => 100; my $x = zeroes( 200, 200 ); # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; my $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; $ct += time - $t_; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n==3) * !$x); } printf "Tutorial: %0.3f s (core time: %0.3f)\n", time - $t, $ct; # "Lags" my $m = $backup-> copy; $t = time; $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = sms_GoL_lags( $m ) - $m; $ct += time - $t_; # Calculate the next generation. $m = ((($n == 2) + ($n == 3))* $m) + (($n == 3) * !$m); } printf "\"lags\": %0.3f s (core time: %0.3f)\n", time - $t, $ct; die unless eq_pdl( $x, $m ); sub _do_dimension_GoL ( $m ) { $m-> slice( -1 )-> glue( 0, $m, $m-> slice( 0 )) -> lags( 0, 1, ( $m-> dims )[0] ) -> sumover -> slice( '', '-1:0' ) -> xchg( 0, 1 ) } sub sms_GoL_lags ( $m ) { _do_dimension_GoL _do_dimension_GoL $m } __END__ Game of Life! Matrix: PDL: Double D [200,200], 100 generations Tutorial: 1.016 s (core time: 0.835) "lags": 0.283 s (core time: 0.108)

    Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. $n == 3 array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.

      Hi Steve (assuming it's you),

      I haven't spent any effort on optimizing the solution. Thank you for running the tests! It is interesting to see that ->clump(2)->sumover doesn't perform as good as ->sumover->sumover. Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory. Looks like I'm getting older :-)

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from:

      • To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function.
      • Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4).
      use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant { WIDTH => 20, HEIGHT => 20, STEPS => 1000, }; my $x = zeroes long, WIDTH, HEIGHT; # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n == 3) * !$x); } printf "Tutorial: %0.3f s\n", time - $t; # Faster my $m = $backup-> copy; $t = time; my $wrap_w = pdl [ reverse WIDTH - 1, ( 0 .. WIDTH - 1 ), 0 ]; my $wrap_h = pdl [ reverse HEIGHT - 1, ( 0 .. HEIGHT - 1 ), 0 ]; for ( 1 .. STEPS ) { my $n = $m -> dice_axis( 0, $wrap_w ) -> lags( 0, 1, WIDTH ) -> sumover -> dice_axis( 1, $wrap_h ) -> lags( 1, 1, HEIGHT ) -> xchg( 0, 1 ) -> sumover; $n -= $m; $m = ( $n == 3 ) | $m & ( $n == 2 ) } printf "Faster: %0.3f s\n", time - $t; die unless eq_pdl( $x, $m ); __END__ Game of Life! Matrix: PDL: Long D [20,20], 1000 generations Tutorial: 0.341 s Faster: 0.111 s Matrix: PDL: Long D [200,200], 100 generations Tutorial: 0.845 s Faster: 0.086 s Matrix: PDL: Long D [1000,1000], 20 generations Tutorial: 4.422 s Faster: 0.443 s Matrix: PDL: Long D [2000,2000], 10 generations Tutorial: 8.878 s Faster: 0.872 s

        Oops, sorry, one more: subtraction is not required of course, and, more importantly, next gen calculation expression can be split in 2 lines -- same 4 ops but presumably done in-place i.e. faster:

        $m &= $n == 4; $m |= $n == 3; __END__ Game of Life! Matrix: PDL: Long D [2000,2000], 10 generations Faster+: 0.677 s

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11156598]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-06-25 03:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.