Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^9: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)

by marioroy (Prior)
on May 17, 2022 at 20:21 UTC ( [id://11143955]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Confirming on not seeing recursion limit warnings :)

I tried various examples on PerlMonks. Getting WARNINGs with PDL 2.079 running demonstration 11115875. That was not the case before when the example was created.

WARNING: PDL::Primitive::vsearch_insert_leftmost does not handle bad v +alues. WARNING: PDL::Primitive::vsearch_insert_leftmost does not handle bad v +alues. ...

Unfortunately, vr's serial demonstration 11116069 runs poorly; 42s versus 19s targ(1). It requires disabling auto-parallelization.

PDL::set_autopthread_targ(1);

Using the 2nd example (the one for Windows, but also runs on UNIX) 11116094, there is no improvement beyond 8 workers on the Windows platform. On Linux, no problem where more workers up to the number of logical cores improves performance (less time). So I updated the code to cap at 8 workers max on Windows. Strawberry Perl 5.32.1.1 PDL edition (w/ included PDL 2.021) is noticeably faster versus PDL 2.079. Is that the case for you?

PDL 2.021 using Strawberry Perl 5.32.1.1 PDL edition (max 8 workers on + Windows) perl demo_win.pl 1e7 : 3.778 seconds perl demo_win.pl 1e8 : 37.008 seconds PDL 2.079 using Strawberry Perl 5.32.1.1 PDL edition (PDL updated to 2 +.079) perl demo_win.pl 1e7 : 4.143 seconds perl demo_win.pl 1e8 : 42.302 seconds

I ran the same demonstration on Ubuntu Linux 20.04 using Perl 5.30.0 and PDL 2.079.

perl demo_win.pl 1e7 : 3.288 seconds ( 8 workers) perl demo_win.pl 1e8 : 41.899 seconds perl demo_win.pl 1e7 : 2.331 seconds (16 workers) perl demo_win.pl 1e8 : 23.329 seconds perl demo_win.pl 1e7 : 1.642 seconds (24 workers) perl demo_win.pl 1e8 : 16.941 seconds perl demo_win.pl 1e7 : 1.310 seconds (32 workers) perl demo_win.pl 1e8 : 13.198 seconds perl demo_win.pl 1e7 : 1.139 seconds (40 workers) perl demo_win.pl 1e8 : 10.925 seconds perl demo_win.pl 1e7 : 1.004 seconds (48 workers) perl demo_win.pl 1e8 : 9.791 seconds perl demo_win.pl 1e7 : 0.946 seconds (56 workers) perl demo_win.pl 1e8 : 8.913 seconds perl demo_win.pl 1e7 : 0.877 seconds (64 workers) perl demo_win.pl 1e8 : 8.305 seconds

Replies are listed 'Best First'.
Re^10: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
by etj (Deacon) on May 18, 2022 at 23:13 UTC
    Glad to hear the recursion problem is solved.

    I haven't tried this, I'm afraid, as I'm currently fixing up PDL's macro mechanism, which has required a bit of a rejig of the whole code-generation stuff.

    It would be incredibly helpful if someone could run the two versions of PDL with Devel::NYTProf and reply here with roughly where the slowdown is. I appreciate it might be within the C code, but more information seems like it would be better.

      First, I disabled SMT (hyperthreading) to ensure two threads do not run on a physical core. Next, I increased chunk size from 40,000 to 200,000 to better understand the time gap between PDL 2.021 and PDL 2.079.

      PDL 2.021: perl -d:NYTProf demo_win.pl 1e7 # 5.238 secs. PDL 2.079: perl -d:NYTProf demo_win.pl 1e7 # 9.511 secs.

      There are many subroutine calls using PDL 2.079, not present using PDL 2.021. This is not the reason for the slowness, but simply noting due to the high number of calls. Well, I reverted the File::Which change locally to be sure not the reason.

      Calls P F ExTime InTime Subroutine 61776 1 1 186ms 186ms File::Which::CORE::regcomp (opcode) 18 2 1 119ms 394ms File::Which::which 5616 1 1 66.6ms 66.6ms File::Which::CORE:ftdir (opcode) 61776 1 1 15.9ms 15.9ms File::Which::CORE:match (opcode) 5616 1 1 800µs 800µs File::Which::CORE:fteexec (opcode) 5148 1 1 668µs 668µs File::Which::CORE:ftis (opcode)

      Testing was done using Strawberry Perl v5.32.1.1 - PDL edition. I extracted the bundle twice to C:\perl-5.32.0.1-PDL and C:\perl-5.32.0.1-recent (updated PDL from 2.021 to 2.079 - that is obtaining PDL 2.079 and run perl Makefile.PL followed by gmake install).

      I'm hoping that someone on the PDL team can take similar steps to determine the issue. I ran with 8 workers in demo_win.pl. The slowness is also present on Linux. PDL 2.021 (2.190 secs) vs PDL 2.079 (2.913 secs).

      Modules

      File::Map 0.67 MCE 1.878

      Update:The following is a test script factoring out MCE, File::Map, and PDL::IO::FastRaw.

      use strict; use warnings; use feature 'say'; use PDL; use Time::HiRes 'time'; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use constant MAX => shift || 500000; use constant MAXLEN => MAX * 1; my $t = time; my $lengths = ones( short, 3 + MAXLEN ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); my ($from, $to) = (0, MAX); my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd *= 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } say {*STDERR} time - $t;

      PDL 2.021 is noticeably faster than PDL 2.079.

      $PDL::BIGPDL = 1; PDL 2.021: perl test.pl 2e6 # 4.590 secs. PDL 2.079: perl test.pl 2e6 # 7.252 secs. # $PDL::BIGPDL = 1; # line commented out PDL 2.021: perl test.pl 2e6 # 4.490 secs. PDL 2.079: perl test.pl 2e6 # 5.252 secs.
        Thanks, this is amazing work! How much work would it be to try each of the operations within the script some suitable number of times, on each PDL version (or at least isgood), in order to see if any of them stand out as the one that's got slower? (or conceivably it's across the board).

        Immediate surmise: there have been updates to the badvalue detection functionality to allow NaN to be used as a badvalue in 2.040 and fixes to that in 2.064. Ergo, if you do have capacity to performance-test at least isgood and see that's what's slowed down, that will be at least possible to fix. One approach there would be to break out the "badvalue is NaN" branch out of the current broadcast loop into its own loop to avoid constantly checking a value that doesn't change within nearly all operations (the badvalue) to see if it's still (or still not) a NaN.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11143955]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 22:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found