Optimizing with Caching vs. Parallelizing (MCE::Map)

Replies are listed 'Best First'.
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 06, 2020 at 05:55 UTC
Hi 1nickt, One may simulate extra CPU time to better understand the benefit of caching. I converted Laurent_R's example to consume multiple CPU cores. Laurent's example incurs extra overhead due to calling a subroutine inside the loop and returning a full list. I changed the number of iterations from 1 million to 100 thousand now that simulating extra CPU cycles. Nick's example use strict; use warnings; use feature 'say'; use MCE::Util; use MCE::Map max_workers => MCE::Util::get_ncpu(); use Time::HiRes 'usleep'; my @output = mce_map_s { my $input = $_; my $n = $input; my @result = $input; while ( $n != 1 ) { $n = $n % 2 ? 3 * $n + 1 : $n / 2; usleep 20; # simulate extra computation push @result, $n; } return [ $input, scalar @result ]; } 1, 100000; MCE::Map->finish; @output = sort { $b->[1] <=> $a->[1] } @output; say sprintf('%s : length %s', $_->[0], $_->[1]) for @output[0..19]; [download] Laurent's example - no caching use strict; use warnings; use feature qw/say/; use MCE::Util; use MCE::Map max_workers => MCE::Util::get_ncpu(); use Time::HiRes qw/usleep/; sub collatz_seq { my $n = shift; my @result; while ($n != 1) { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; usleep 20; # simulate extra computation push @result, $new_n; $n = $new_n; } return @result; } my @long_seqs = mce_map_s { my $num = $_; my @seq = ($num, collatz_seq $num); return [ $num, scalar @seq ]; } 1, 100000; MCE::Map->finish; @long_seqs = sort { $b->[1] <=> $a->[1] } @long_seqs; say "$_->[0] : length $_->[1]" for @long_seqs[0..19]; [download] Laurent's example - with caching use strict; use warnings; use feature qw/say/; use constant MAX => 300000; use MCE::Util; use MCE::Map max_workers => MCE::Util::get_ncpu(); use Time::HiRes qw/usleep/; my %cache; sub collatz_seq { my $input = shift; my $n = $input; my @result; while ($n != 1) { if (exists $cache{$n}) { push @result, @{ $cache{$n} }; last; } my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; usleep 20; # simulate extra computation push @result, $new_n; $cache{$n} = [ $new_n, @{ $cache{$new_n} } ] if $n < MAX && exists $cache{$new_n}; $n = $new_n; } $cache{$input} = \@result if $n < MAX; return @result; } my @long_seqs = mce_map_s { my $num = $_; my @seq = ($num, collatz_seq $num); return [ $num, scalar @seq ]; } 1, 100000; MCE::Map->finish; @long_seqs = sort { $b->[1] <=> $a->[1] } @long_seqs; say "$_->[0] : length $_->[1]" for @long_seqs[0..19]; [download] Program output `77031 : length 351 52527 : length 340 78791 : length 338 60975 : length 335 87087 : length 333 88059 : length 333 91463 : length 333 63387 : length 330 95081 : length 328 99067 : length 328 99721 : length 328 71310 : length 325 71311 : length 325 74791 : length 325 74793 : length 325 35655 : length 324 53483 : length 322 56095 : length 322 80225 : length 320 81159 : length 320` [download] Time to run with 12 cores using a Linux VM Notice the extra 4 seconds user time (likely function overhead inside loop), but not felt in real time (wall clock time). `Nick's example: 1m32.958s real, 0m56.561s user - no caching Laurent's example: 1m33.454s real, 1m00.491s user - no caching Laurent's example: 0m18.761s real, 0m20.990s user - caching` [download] Caching is helpful here (5x faster) because I added sleep to simulate more CPU cycles. However, caching is not helpful when the overhead involved is greater than the computation itself as demonstrated by 1nickt's comparison above. Regards, Mario	[reply] [d/l] [select]
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by davido (Cardinal) on Apr 05, 2020 at 21:47 UTC
Hi Nick. I wonder if you could use an algorithm that could actually benefit both from parallel processing and from caching. If you take your data set and interleaf it /8 or /12 (for your 12 cores), and then do the caching within each interleaf, you will have to regenerate the cache for each worker, but each worker maintains its own cache so that there's no per-iteration write back. iteration 1 iteration 2 iteration 3 iteration 4 worker 1 1 9 17 25 worker 2 2 10 18 26 worker 3 3 11 19 27 worker 4 4 12 20 28 worker 5 5 13 21 29 worker 6 6 14 22 30 worker 7 7 15 23 31 worker 8 8 16 24 32 So worker 1 gets 1, 9, 17, 25, ... to work on. Worker 2 gets 2, 10, 18, 26. Worker 3 gets 3, 11, 19, 27. And so on. Within each worker implement caching. If I were using Parallel::ForkManager I would have them each return a structure identifying the worker starting offset, and the sequences. That way you don't have to compose them back into a master structure at the end for output, you can sequence them round-robin based on the starting offset of each. I've sat down a few times today with the intent to give it a try, but keep getting interrupted before I get very far into it. So it's probably not going to happen. Interleafing ought to allow each worker to build a cache pretty quickly. Dave	[reply]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by 1nickt (Canon) on Apr 06, 2020 at 01:00 UTC
Hi davido, I did indeed try that approach, and while I did not interleave the sequence, as I mentioned MCE chunks it up. (I don't think interleaving would really much affect the cache hits, as the caching tested stores the sequence for each number found, and the algorithm produces sequences that themselves include numbers greater than `n`, so such a boundary would be difficult to enforce.) In testing, with each worker building and using its own cache for the chuck processed, the program ran almost three times slower than just letting the workers hammer the CPU. Again, I believe it's because in this case the overhead of caching outweighs its benefits. The way forward always starts with a minimal test.	[reply] [d/l]
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?) by vr (Curate) on Apr 07, 2020 at 12:43 UTC
FWIW, here's a better single-threaded caching algorithm, which at my machine with 4 puny little cores outperforms others I checked: use strict; use warnings; use feature 'say'; use Data::Dump 'dd'; use Time::HiRes 'time'; my $t = time; my %cache = ( 1 => [ 1, 1, undef ]); # # [ seq_length, start_num (same as key), next_num ] # sub _cache_collatz { my $n = shift; return if exists $cache{ $n }; my ( @tmp , $new_n ); while ( !exists $cache{ $n }) { $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; push @tmp, $cache{ $n } = [ -$#tmp, $n, $new_n ]; $n = $new_n } my $d = $#tmp + $cache{ $n }[ 0 ]; $_-> [ 0 ] += $d for @tmp } sub collatz { my $n = shift; _cache_collatz( $n ); my @a = $n; push @a, $n while $n = $cache{ $n }[ 2 ]; return \@a; } ## Demo # # dd collatz( 23 ); # dd \%cache; # exit( 0 ); # ## _cache_collatz( $_ ) for 1 .. 1e6; say "@$_[ 1, 0 ]" for ( sort { $b-> [ 0 ] <=> $a-> [ 0 ]} @cache{ 1 .. 1e6 })[ 0 .. 19 ]; say time - $t; __END__ 837799 525 626331 509 939497 507 704623 504 910107 476 927003 476 511935 470 767903 468 796095 468 970599 458 546681 452 818943 450 820022 450 820023 450 410011 449 615017 447 886953 445 906175 445 922524 445 922525 445 6.44625306129456 [download] For comparison: laurent.pl: 30.7539658546448 choroba.pl: 58.0188150405884 1nickt_orig.pl: 11.0160529613495 1nickt.pl: 7.18484592437744 Apparently, the last one will beat my "better caching" if run on 6 or more cores, but, like I said, I have only 4 :)	[reply] [d/l]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?) by marioroy (Prior) on Apr 08, 2020 at 02:40 UTC
Hi vr, Wow! Curiosity moved me to try with MCE. But first, I optimized Nick's code. Also, workers now send only the relevant data and not the entire list. This decreases the time needed for the manager process to output results. Update 1: Slight optimization to Choroba's example (i.e. >> 1). Added collatz_compress. Update 2: Added collatz_count. Update 3: Improved collatz_count. Update 4: Added Inline C times. Demo choroba, 1nickt, compress and count all in one. use strict; use warnings; use feature 'say'; use MCE::Flow; sub collatz_choroba { my ($n) = @_; my @seq = $n; push @seq, ( $seq[-1] >> 1, 3 * $seq[-1] + 1 )[ $seq[-1] % 2 ] while $seq[-1] != 1; return scalar @seq; } sub collatz_1nickt { my ($n) = @_; my @seq = $n; push @seq, $n = $n % 2 ? 3 * $n + 1 : $n >> 1 while $n != 1; return scalar @seq; } sub collatz_compress { my ($n) = @_; my @seq = $n; # Notation and compressed dynamics (2:15 into the video) # https://www.youtube.com/watch?v=t1I9uHF9X5Y # consider the map # C(x) = $n % 2 ? 3 * $n + 1 : $n / 2 # compress the map to the dynamically equivalent # T(x) = $n % 2 ? (3 * $n + 1) / 2 : $n / 2 # compress loop iteration when odd sequence push @seq, $n % 2 ? ( $n = 3 * $n + 1, $n = $n >> 1 ) : ( $n = $n >> 1 ) while $n != 1; return scalar @seq; } sub collatz_count { my ($n) = @_; my $steps = 1; # count using the T(x) notation $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1; return $steps; } #collatz = \&collatz_choroba; # choose collatz here #collatz = \&collatz_1nickt; #collatz = \&collatz_compress; collatz = \&collatz_count; my @sizes; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), gather => \@sizes, bounds_only => 1, ); mce_flow_s sub { my ($start, $end) = @{ $_[1] }; my @ret = map { [ $_, collatz($_) ] } $start .. $end; MCE->gather( ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] ); }, 1, 1e6; MCE::Flow->finish; say "@$_" for ( sort { $b->[1] <=> $a->[1] } @sizes )[ 0..19 ]; [download] Demo caching by vr use strict; use warnings; use feature 'say'; use MCE::Flow; my %cache = ( 1 => [ 1, 1, undef ]); # # [ seq_length, start_num (same as key), next_num ] # sub collatz_vr { my $n = shift; return if exists $cache{ $n }; my ( @tmp, $new_n ); while ( !exists $cache{ $n } ) { $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; push @tmp, $cache{ $n } = [ -$#tmp, $n, $new_n ]; $n = $new_n } my $d = $#tmp + $cache{ $n }[ 0 ]; $_->[ 0 ] += $d for @tmp; } my @sizes; mce_flow_s { max_workers => MCE::Util::get_ncpu(), gather => \@sizes, bounds_only => 1, }, sub { my ($start, $end) = @{ $_[1] }; collatz_vr($_) for $start .. $end; MCE->gather( ( sort { $b->[0] <=> $a->[0] } @cache{ $start .. $end } )[ 0 . +. 19 ] ); }, 1, 1e6; MCE::Flow->finish; say "@$_[ 1, 0 ]" for ( sort { $b->[0] <=> $a->[0] } @sizes )[ 0 .. 19 ]; [download] Run time on a 32 core machine - SMT disabled The times include launching Perl, loading modules, spawning workers, reaping workers, and output (~ 0.100 seconds). See the next post for the Inline C demonstration. max_workers => 1 collatz_vr 3.617s 1.0x collatz_choroba 14.896s 1.0x collatz_1nickt 12.264s 1.0x collatz_compress 8.932s 1.0x collatz_count 7.021s 1.0x collatz_count_c1 0.741s 1.0x Inline C collatz_count_c2 0.625s 1.0x Inline C max_workers => 2 collatz_vr 2.707s 1.3x collatz_choroba 7.538s 2.0x collatz_1nickt 6.141s 2.0x collatz_compress 4.490s 2.0x collatz_count 3.537s 2.0x collatz_count_c1 0.398s 1.9x Inline C collatz_count_c2 0.339s 1.8x Inline C max_workers => 4 collatz_vr 1.906s 1.9x collatz_choroba 3.948s 3.8x collatz_1nickt 3.274s 3.7x collatz_compress 2.365s 3.8x collatz_count 1.873s 3.7x collatz_count_c1 0.235s 3.2x Inline C collatz_count_c2 0.203s 3.1x Inline C max_workers => 8 collatz_vr 1.458s 2.5x collatz_choroba 1.995s 7.5x collatz_1nickt 1.655s 7.4x collatz_compress 1.209s 7.4x collatz_count 0.961s 7.3x collatz_count_c1 0.149s 5.0x Inline C collatz_count_c2 0.132s 4.7x Inline C max_workers => 16 collatz_vr 0.976s 3.7x collatz_choroba 1.034s 14.4x collatz_1nickt 0.858s 14.3x collatz_compress 0.637s 14.0x collatz_count 0.509s 13.8x collatz_count_c1 0.114s 6.5x Inline C collatz_count_c2 0.105s 6.0x Inline C max_workers => 32 collatz_vr 0.837s 4.3x collatz_choroba 0.593s 25.1x collatz_1nickt 0.497s 24.7x collatz_compress 0.379s 23.6x collatz_count 0.310s 22.6x collatz_count_c1 0.112s 6.6x Inline C collatz_count_c2 0.108s 5.8x Inline C [download] Interesting. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling versus the non-cache demonstrations. collatz_count (2 cores) reaches collatz_vr (1 core). collatz_count (4 cores) reaches collatz_vr (4 cores). Kind regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (Inline C) by marioroy (Prior) on Apr 11, 2020 at 04:37 UTC
Hi, There was one thing left to try and that is using Inline::C. I captured the top 20 for 1e9 and 1e10. Update 1: Set chunk_size to reduce IPC. Tried staying within CPU cache by keeping @ret small. 1e10 now completes in 4 minutes (20% improvement). Update 2: Added collatz_count_c2 using GCC compiler intrinsics. use strict; use warnings; use feature 'say'; use MCE::Flow; use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer', clean_after_build => 0; use Inline C => <<'END_OF_C_CODE'; #include <stdlib.h> #include <stdint.h> #if defined(_WIN32) #define strtoull _strtoui64 #endif void collatz_count_c1( SV* _n ) { uint64_t n, steps = 1; /* include 1 / Inline_Stack_Vars; #ifdef __LP64__ n = SvUV( _n ); #else n = strtoull( SvPV_nolen( _n ), NULL, 10 ); #endif / count using the T(x) notation / do { n % 2 ? ( steps += 2, n = (3 n + 1) >> 1 ) : ( steps += 1, n = n >> 1 ); } while ( n != 1 ); Inline_Stack_Reset; Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) ); Inline_Stack_Done; } void collatz_count_c2( SV* _n ) { uint64_t n, steps = 1; /* include 1 / Inline_Stack_Vars; #ifdef __LP64__ n = SvUV( _n ); #else n = strtoull( SvPV_nolen( _n ), NULL, 10 ); #endif / based on GCC compiler intrinsics demonstration by Alex Shirley +/ / https://stackoverflow.com/questions/38114205/c-collatz-conjectu +re-optimization / / https://www.go4expert.com/articles/builtin-gcc-functions-builti +nclz-t29238 / if ( n % 2 == 0 ) { steps += __builtin_ctz(n); / account for all evens / n >>= __builtin_ctz(n); / always returns an odd / } / when we enter we're always working on an odd number / do { n = 3 n + 1; steps += __builtin_ctz(n) + 1; /* account for odd and even / n >>= __builtin_ctz(n); / always returns an odd / } while ( n != 1 ); Inline_Stack_Reset; Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) ); Inline_Stack_Done; } END_OF_C_CODE sub collatz_count { my ($n) = @_; my $steps = 1; # count 1 # count using the T(x) notation $n % 2 ? ( $steps += 2, $n = (3 $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1; return $steps; } no warnings 'once'; #collatz = \&collatz_count; # choose collatz here #collatz = \&collatz_count_c1; # using the T(x) notation collatz = \&collatz_count_c2; # using compiler intrinsics my $m = shift \|\| 1e6; my ( @sizes, $chunk_size ); $chunk_size = int( $m / MCE::Util::get_ncpu() / 40 + 1 ); $chunk_size += 1 if $chunk_size % 2; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size, gather => \@sizes, bounds_only => 1, ); mce_flow_s sub { my ( $start, $end ) = @{ $_[1] }; my @ret; for ( $start .. $end ) { push @ret, [ $_, collatz($_) ]; @ret = ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] if @ret > 100; } ( @ret > 20 ) ? MCE->gather( ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] +) : MCE->gather( @ret ); }, 1, $m; MCE::Flow->finish; say "@$_" for ( sort { $b->[1] <=> $a->[1] } @sizes )[ 0..19 ]; [download] Output* 1..1e9 collatz_count 6m10.877s 32 cores Perl collatz_count_c1 11m20.134s 1 core Inline C collatz_count_c2 8m56.683s 1 core Inline C collatz_count_c1 0m23.443s 32 cores 29.0x collatz_count_c2 0m18.507s 32 cores 29.0x 670617279 987 848749995 977 954843745 972 954843751 972 716132809 969 537099606 966 537099607 966 268549803 965 805649409 964 805649410 964 805649411 964 805649415 964 402824705 963 604237057 961 604237058 961 604237059 961 302118529 960 906355586 959 906355587 959 906355588 959 1..1e10, 32 cores, Inline C collatz_count_c1 4m00.323s collatz_count_c2 3m09.774s 9780657630 1133 9780657631 1133 4890328815 1132 7335493223 1130 6189322407 1122 9283983611 1120 8140184737 1094 6105138553 1091 9157707830 1089 9157707831 1089 4578853915 1088 6868280873 1086 5151210655 1083 7726815983 1081 9282648843 1058 6961986633 1055 5221489974 1052 5221489975 1052 2610744987 1051 7832234962 1050 1..1e11, 32 cores, Inline C collatz_count_c1 40m50.177s collatz_count_c2 32m17.319s 75128138247 1229 84519155529 1224 85833820801 1224 63389366646 1221 63389366647 1221 64375365601 1221 31694683323 1220 95084049969 1219 95084049970 1219 95084049971 1219 96563048402 1219 96563048403 1219 47542024985 1218 48281524201 1218 71313037476 1216 71313037477 1216 71313037478 1216 71313037479 1216 72422286302 1216 72422286303 1216 [download] Look at C go! Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?) by vr (Curate) on Apr 08, 2020 at 21:40 UTC
Hi marioroy, I think I'm having an issue with MCE on Windows here, timing your code (collatz_vr): `use Time::HiRes 'time'; my $t = time; # the whole script here say time - $t; MCE::Flow->finish; say time - $t; __END__ 1 worker: 6.01482510566711 7.76495599746704 2 workers: 4.12953305244446 7.07751798629761 4 workers: 3.33010196685791 8.4802930355072` [download] 1st measurement approximately matches your output, but 1.5 - 2 seconds per worker to shutdown doesn't look OK to me. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling as well versus the non-cache demonstrations In other words, lots and lots of work is needlessly duplicated. `_cache_collatz( $_ ) for 1 .. 1e6; say scalar %cache; %cache = (); _cache_collatz( $_ ) for 1 + 4e5 .. 1e6; say scalar %cache; __END__ 2168611 2168611` [download] So, effectively, in situation with e.g. 10 workers, 4 junior workers, filling cache for ranges up to 400_000, are free to slack off and not gather their results at all, and there's large amount of overlap in work of 6 senior workers, too. For now, I have no solid idea how to parallelize this algorithm efficiently.	[reply] [d/l] [select]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?) by marioroy (Prior) on Apr 09, 2020 at 08:30 UTC
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?) by 1nickt (Canon) on Apr 07, 2020 at 15:54 UTC
Impressive. ++ The way forward always starts with a minimal test.	[reply]
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by choroba (Cardinal) on Apr 06, 2020 at 09:32 UTC
> Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow Interestingly, the difference is about 15%, see my blogpost about the same task. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 06, 2020 at 15:52 UTC
Hi choroba, I tried using multiple cores for your demonstration. Perl has this capability. So little effort and practically transparent. :) `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use MCE::Util; use MCE::Map max_workers => MCE::Util::get_ncpu(); sub collatz { my ($start) = @_; my @seq = $start; push @seq, ($seq[-1] / 2, 3 * $seq[-1] + 1)[$seq[-1] % 2] while $seq[-1] != 1; return @seq } my @sizes = mce_map_s { [$_, scalar collatz($_)] } 1, 1e6; say "@$_" for reverse +(sort { $b->[1] <=> $a->[1] } @sizes)[0 .. 19];` [download] Before and after on a 6 core machine with hyper-threading - 12 logical cores. `Serial 48.170 seconds Parallel 9.361 seconds` [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) by choroba (Cardinal) on Apr 06, 2020 at 16:23 UTC
Interesting! Does mce_map_s return the elements in the corresponding order? As you can see, it's not needed here, so if there's a faster method that doesn't care about the order, we can also switch to that. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 06, 2020 at 18:19 UTC
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) (Updated) by 1nickt (Canon) on Apr 06, 2020 at 18:49 UTC
Re^5: Optimizing with Caching vs. Parallelizing (MCE::Map) (Updated) by marioroy (Prior) on Apr 07, 2020 at 00:48 UTC
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by rjt (Curate) on Apr 19, 2020 at 23:31 UTC
Great discussions! I'm sorry I missed all this; I haven't had much time to visit here recently. I'm the one who contributed the task, so I have some interest in this beyond my own personal fascination with the Collatz sequence. My own solution uses memoization and a couple of other optimizations. The full code is at https://github.com/manwar/perlweeklychallenge-club/blob/master/challenge-054/ryan-thompson/perl/ch-2.pl, and my "review" of my solution is at https://perlweeklychallenge.org/blog/review-challenge-054/#ryan-thompson2. That page also contains links to, and reviews of, every solution submitted by the participants in the challenge that week. None of them used MCE::Map, which is a shame! Data structures and bookkeeping were key to good performance on this one, for me: `my @seqlen = (-1,1); # Memoize sequence length my $top = 20; # Report this many of the top sequences my @top = [ -1,-1 ]; # Top $top sequences my $upper = 1e6; # Upper limit starting term my $mintop = 0; # Lowest value in @top GetOptions('top=i' => \$top, 'upper=i' => \$upper);` [download] Here is the main loop: `for (my $start = 3; $start < $upper; $start += 2) { my ($n, $len) = ($start, 0); while (! defined $seqlen[$n]) { $len += 1 + $n % 2; $n = $n % 2 ? (3$n + 1)/2 : $n / 2; } $len += $seqlen[$n]; $seqlen[$start] = $len if $start < $upper 2; # Cache top($start => $len) if $len > $mintop and $start < += $upper; top($n * 2 => $seqlen[$n] + 1) if $n < $upper/2 and $seqlen[$n] +> $mintop; }` [download] I avoid function call overhead by making sure everything is done iteratively. As another optimization, instead of simply doing 3n+1 for odd numbers, I do 3n+1/2, and increment the sequence length by two instead of one. And finally, I'm able to skip even-numbered starts with a little creative arithmetic with my second call to `top()`. I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself. Maintaining it naively by calling `sort` on a million elements at the end takes longer than the above loop, and sorting a 20-item list repeatedly is even worse. Maintaining essentially a priority queue is much faster: `sub top { my ($n, $len) = @_; my $idx = first { $top[$_][1] < $len } 0..$#top; splice @top, $idx, 0, [ $n, $len ]; pop @top if @top > $top; $mintop = $top[-1][1]; }` [download] The above sub is O(n), so it's not as good as a heap implementation, but it's only called when there is definitely a new element to be inserted, thanks to a bit of bookkeeping in `$mintop`, so I opted to keep it simple. Performance: `real 0m0.848s user 0m0.835s sys 0m0.012s` [download] Purely for crude CPU comparison purposes, Laurent's solution in Re: Optimizing with Caching vs. Parallelizing (MCE::Map) runs in 1.57 sec on the same (virtual) machine. `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (traps for the unwary) by vr (Curate) on Apr 20, 2020 at 10:00 UTC
Many thanks for providing this wonderful challenge! If all goes at the pace it's rolling now, the fun may well continue into the month of May! Therefore, a warning: some people are at danger of gaining (or loosing) more than what they have bargained for!:) I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself Well said, dear rjt, well said. But it seems it were YOU, who fell into the trap that you so cunningly crafted for poor innocent learners! The top 20 out of million Collatz sequences has an insidious property: there are six "445" lengths in 1e6, but only four in top-20 (so, all six are in top-22). Which to extract? Aha! Well, the challenge did not clearly state how to order numbers with the same Collatz lengths (CL). But I think it is reasonable to assume, that, since numbers with longer CL are rated better/closer to top, then smaller numbers among producing same CL are to be valued more. Like: "Look! This brave little number commendably creates as long CL as that huge number! And this undeservedly huge number has only managed to generate so puny CL! Loser!" At least, there must be some consistency in arranging results, don't you agree? In other words, I think that if CL column descends, then numbers column must ascend (for equal CLs). See ordering for CL 450, too. Well, dear Perl users, I happened to notice this, because in another my (unpublished) solution I had to endure great pain in arranging top-20 properly. Just how many of you, looking at output of (almost all) scripts in this and related 2 threads, have noticed, that top 20 are neatly ordered? And yet, this "natural" result comes at no cost at all, with Perl! Just appreciate what you so ungratefully consume!:) To illustrate, rjt, here's, in parallel, output of your script and Laurent_R's with marioroy fixes: Collatz(837799) has sequence length of 525 steps \| 837799: 525 Collatz(626331) has sequence length of 509 steps \| 626331: 509 Collatz(939497) has sequence length of 507 steps \| 939497: 507 Collatz(704623) has sequence length of 504 steps \| 704623: 504 Collatz(910107) has sequence length of 476 steps \| 910107: 476 Collatz(927003) has sequence length of 476 steps \| 927003: 476 Collatz(511935) has sequence length of 470 steps \| 511935: 470 Collatz(767903) has sequence length of 468 steps \| 767903: 468 Collatz(796095) has sequence length of 468 steps \| 796095: 468 Collatz(970599) has sequence length of 458 steps \| 970599: 458 Collatz(546681) has sequence length of 452 steps \| 546681: 452 Collatz(820022) has sequence length of 450 steps \| 818943: 450 Collatz(818943) has sequence length of 450 steps \| 820022: 450 Collatz(820023) has sequence length of 450 steps \| 820023: 450 Collatz(410011) has sequence length of 449 steps \| 410011: 449 Collatz(615017) has sequence length of 447 steps \| 615017: 447 Collatz(922526) has sequence length of 445 steps \| 886953: 445 Collatz(922526) has sequence length of 445 steps \| 906175: 445 Collatz(886953) has sequence length of 445 steps \| 922524: 445 Collatz(906175) has sequence length of 445 steps \| 922525: 445 [download] But wait... What's that??? The 922526 number is listed twice, on the left?? Is this... ~~can't believe... is this because of infamous What Every Computer Scientist Should Know About Floating-Point Arithmetic? Because someone:) decided to cut corners and resort to FP?~~ Or is it for another reason? Didn't investigate yet. Just who would have thought that this would surface in so innocent task "calculate lengths of Collatz sequences". Wow! Great challenge! Edit. No, of course it's not floating point issue. Algo is broken. CLs for odd numbers are cached, but for even numbers they are not. Some even numbers are never passed to &top (e.g. 922524), others are pumped into this subroutine several times. The 922526 hadn't been phased out from @top by odd numbers with longer CLs. With $top large enough, there are many even dupes.	[reply] [d/l]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (traps for the unwary) by rjt (Curate) on Apr 20, 2020 at 19:41 UTC
Ha! Good eye. I didn't even notice the doubled-up 922526. It's not a FP bug. Rather, it's a corner case thanks to how I combined the /2 optimization + memoization; certain even numbers get added to the p.queue twice. It can be fixed trivially by either removing the /2 optimization (simpler, ~5% penalty), or skipping seen numbers in the second call to `top()` (no measurable penalty, adds a variable). As to your interpretation of the "top 20 arrangement," I like your discussion! We try to keep the task descriptions only quasi-formal, to keep the challenge accessible to beginners, which is why you don't usually see these sorts of details specified like a requirements document. Meaning, many "minor" details are left to the discretion of the participants. The upshot of that is, if you submit a really weird interpretation, you'd probably net yourself a mildly amusing comment in my next review, at least. :-) `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 20, 2020 at 19:07 UTC
Hi rjt, Thank you for this challenge. This consumed so much of my time in a great way. The reason is partly due to, "What if possible for many CPU cores?" But first made attempts for fast using 1 core. Below are the 3 progressive solutions, each one running faster. Update: Added results from two machines. Laurent's demonstration plus updates: #!/usr/bin/env perl use strict; use warnings; my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # Laurent's demonstration + updates # https://www.perlmonks.org/?node_id=11115520 # https://www.perlmonks.org/?node_id=11115540 # # Parallel solution # https://www.perlmonks.org/?node_id=11115544 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { $n = $input, $steps = 0; while ($n != 1) { $steps += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ); } $cache[$input] = $steps if $input < $size; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] iM71's C++ demonstration converted to Perl plus updates: #!/usr/bin/env perl use strict; use warnings; my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # iM71's demonstration + applied T(x) notation and compression # https://stackoverflow.com/a/55361008 # https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video) # # Parallel solution # https://www.perlmonks.org/?node_id=11115780 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { $n = $input, $steps = 0; $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1 && $n >= $input; $cache[$input] = $steps += $cache[$n]; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] Step counting using Inline C: #!/usr/bin/env perl use strict; use warnings; use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer'; use Inline C => <<'END_OF_C_CODE'; #include <stdint.h> void num_steps_c( SV* _n, SV* _s ) { uint64_t n, input; int steps = 0; n = input = SvUV(_n); while ( n != 1 && n >= input ) { n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 ) : ( steps += 1, n = n >> 1 ); } sv_setuv(_n, n); sv_setiv(_s, steps); return; } END_OF_C_CODE my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # iM71's demonstration + applied T(x) notation and compression # https://stackoverflow.com/a/55361008 # https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video) # # Parallel solution # https://www.perlmonks.org/?node_id=11115780 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { num_steps_c($n = $input, $steps); $cache[$input] = $steps += $cache[$n]; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] Results from two machines: 64-bit VM: rjt 0.903s Laurent + updates 0.696s iM71 + updates 0.602s Step counting in C 0.273s (1st time involves compiling) AMD 3970x: rjt 0.635s Laurent + updates 0.516s iM71 + updates 0.467s Step counting in C 0.191s (1st time involves compiling) Collatz(837799) has sequence length of 525 steps Collatz(626331) has sequence length of 509 steps Collatz(939497) has sequence length of 507 steps Collatz(704623) has sequence length of 504 steps Collatz(910107) has sequence length of 476 steps Collatz(927003) has sequence length of 476 steps Collatz(511935) has sequence length of 470 steps Collatz(767903) has sequence length of 468 steps Collatz(796095) has sequence length of 468 steps Collatz(970599) has sequence length of 458 steps Collatz(546681) has sequence length of 452 steps Collatz(818943) has sequence length of 450 steps Collatz(820022) has sequence length of 450 steps Collatz(820023) has sequence length of 450 steps Collatz(410011) has sequence length of 449 steps Collatz(615017) has sequence length of 447 steps Collatz(886953) has sequence length of 445 steps Collatz(906175) has sequence length of 445 steps Collatz(922524) has sequence length of 445 steps Collatz(922525) has sequence length of 445 steps [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) by rjt (Curate) on Apr 20, 2020 at 20:48 UTC
This is great, Mario (and everyone else in this thread, for that matter)! The multicore work is fantastic. I'm very impressed by the level of interest and dedication this "little" question generated. Hopefully we can come up with a few more like it. (And anyone can suggest challenges, by the way.) `use strict; use warnings;` omitted for brevity.	[reply] [d/l]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 20, 2020 at 23:49 UTC
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by Laurent_R (Canon) on Apr 14, 2020 at 13:53 UTC
Hi dear fellow monks, as a follow-up to my previous post, I implemented my new caching strategy, consisting in storing in the cache the length of the sequences rather than the full sequences. On the computer where I'm running my tests right now, my original program has the following timing: `$ time perl collatz.pl 837799: 525 626331: 509 [lines omitted for brevity] 906175: 445 922524: 445 922525: 445 real 1m37,551s user 1m9,375s sys 0m21,031s` [download] My laptop is obviously much slower than Nick's computer (where this program took 22 seconds). This is now my first implementation with the new caching strategy: #!/usr/bin/perl use strict; use warnings; use feature qw/say/; use constant MAX => 1000000; my %cache = (2 => 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; while ($n != 1) { if (exists $cache{$n}) { $result += $cache{$n}; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; $result++; $cache{$n} = $cache{$new_n} + 1 if defined $cache{$new_n} and $n < MAX; $n = $new_n; } } $cache{$input} = $result if $input < MAX; return $result; } my @long_seqs; for my $num (1..1000000) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19]; [download] This program produces the same outcome, but is nearly 3 times faster: `real 0m34,207s user 0m34,108s sys 0m0,124s` [download] But we now end up with a cache having essentially one entry per input number in the `1..1_000_000` range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup). This is the code for this new implementation: #!/usr/bin/perl use strict; use warnings; use feature qw/say/; use constant MAX => 1000000; my @cache = (0, 1, 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; while ($n != 1) { if (defined $cache[$n]) { $result += $cache[$n]; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < MAX; $n = $new_n; } } $cache[$input] = $result if $input < MAX; return $result; } my @long_seqs; for my $num (1..1000000) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19]; [download] With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache): `$ time perl collatz3.pl 837799: 525 626331: 509 [Lines omitted for brevity] 922524: 445 922525: 445 real 0m1,755s user 0m1,687s sys 0m0,061s` [download] I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it. So, it is true that throwing more CPU cores at the problem makes the solution faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution.	[reply] [d/l] [select]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 14, 2020 at 23:21 UTC
Hi Laurent_R and fellow monks, Nice speedup! I applied 3 optimizations in preparation for a follow-up post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration. Update: Further improvements, Step 4. 1. Replaced division by 2. `$n >> 1;` [download] 2. Removed one level of branching. `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; my $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < $max; $n = $new_n; }` [download] 3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation). `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) if defined $cache[$new_n] and $n < $max; $n = $new_n; }` [download] 4. Finally, less caching. `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; }` [download] The final code optionally takes an argument. #!/usr/bin/perl use strict; use warnings; use feature qw/say/; my $max = shift \|\| 1e6; $max = 1e6 if $max < 1e6; my @cache = (0, 1, 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; my $new_n; while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; } $cache[$input] = $result if $input < $max; return $result; } my @long_seqs; for my $num (1..$max) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19]; [download] Results: $ time perl collatz3_a.pl 1e7 # Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 32.291s (a) original code, accepts argument collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations collatz3_e.pl 1e7 19.357s (e) d + less caching # AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original code, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + less caching 8400511: 686 8865705: 668 6649279: 665 9973919: 663 6674175: 621 7332399: 616 7532665: 616 5649499: 613 8474249: 611 6355687: 608 8847225: 606 9533531: 606 6635419: 603 9953129: 601 7464846: 598 7464847: 598 3732423: 597 5598635: 595 8397953: 593 6298465: 590 [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) by Laurent_R (Canon) on Apr 16, 2020 at 12:57 UTC
Hi Mario, thanks a lot for your comments and suggestions. I did not expect such micro-optimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you. Following this very interesting thread and your new thread on MCE::Flow + Caching via File::Map, I made a new blog post on the Collatz sequence, trying to summarize some of the findings: http://blogs.perl.org/users/laurent_r/2020/04/revisiting-the-collatz-sequence-pwc-54.html. I did not try to explain your demonstration using `File::Map` for caching with parallel processing (your other thread) because I'm not sure to fully understand everything and was afraid of mis-representing your work. People can follow the link and read your own words. Thank you very much for your challenging ideas.	[reply] [d/l]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 17, 2020 at 00:00 UTC
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun) by vr (Curate) on Apr 21, 2020 at 11:14 UTC
("long read":)) Here's an attempt to solve this Collatz challenge in purely vectorized fashion. As it (slowly) progressed, it soon became apparent the emerging result is somewhat impractical, merely a curiosity; now it's finished (i.e. I can't find how to improve it) and ready to amaze/amuse/frighten the public. The "PDL" was mentioned in this (or 2 related) threads, but it served just as binary container, "set/at" having same role as "substr + pack/unpack", so it wasn't really "PDL", was it. This node sticks to original task as stated: calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences The naive straightforward implementation is simple, it should read without much comment (as "prose":)): use strict; use warnings; use feature 'say'; use PDL; use Time::HiRes 'time'; my $t = time; use constant MAX => 1e6; use constant TOP => MAX < 20 ? MAX : 20; my $seqs = 1 + sequence( longlong, MAX ); my $lengths = ones( short, MAX ); while ( any my $good_mask = $seqs-> inplace -> setvaltobad( 1 ) -> isgood ) { my $odd_mask = $seqs & 1; $lengths-> where( $odd_mask ) ++; $lengths-> where( $good_mask ) ++; ( $seqs-> where( $odd_mask ) = 3 ) ++; $seqs >>= 1; } my $top_i = $lengths-> qsorti -> slice([ MAX - 1, MAX - TOP ]); say $lengths-> index( $top_i ) -> longlong -> cat( $top_i + 1 ) -> transpose; say time - $t; __END__ [ [ 525 837799] [ 509 626331] ... [ 445 938143] [ 445 906175] [ 445 922525] [ 445 922526] ] 7.98023009300232 [download] Above, completed sequences are marked as "BAD", which is just agreed-upon value, treated specially. (1) There are built-in methods to check for good/bad, it saves us comparisons while creating masks. (2) More important: BADs kind of taint whatever they interact with (cf. NaN in FP math), which is quite useful. The timing is sloooooow, yet decent among non-caching solutions around here. Another issue stems from QuickSort being unstable (I recently ranted, in this thread, about neat ordering and "correct 20" extraction). To fix the latter, there's qsortvec (and qsortveci companion) method to sort 2D array (as opposed to "qsort" for 1D, used above), i.e. 1st on 1st axis, then on 2nd. But here's dilemma: (1) build full-height (2 x 1e6) array, qsortvec, extract top-20. Possible, but, for speed, I'd prefer (2): qsort lengths (as above), extract "many enough" but close to 20, build small "2 x N" array, qsortvec, extract correct (and correctly arranged) top-20. For that, find value at MAX - TOP (445), look left, find how much to extract (22). More fuss: 1st column is to descend, 2nd to ascend -- thus temporarily negate (flip bits) one of them. So, huge and unpleasantly looking new "tail" after main loop, in script below, is there to fix top-20 extraction. But in fact it adds almost nothing to consumed time. To improve speed, there are couple of tricks. (1) Where has neat ability to work in list assignment -- same mask for several client piddles. (2) Marking 1's as "BAD" on each iteration is redundant. Piddle can be told to treat any value as bad, automatically. Sadly, setting this "any" to "1" won't work here, because `$seqs & 1` would then result in something like `[BAD 0 BAD 0 BAD 0 ...]`, regardless of having BADs in `$seqs` already. Let's mark "2" as bad, so stopper value in sequence would now be 2 instead of 1: `[ BAD BAD 3 4 5 ... ] # initial $seqs [ 1 2 2 2 2 ... ] # initial $lengths` [download] Other than that, it's the same non-caching approach, with original clarity and simplicity somewhat spoiled by fixes/optimizations: use strict; use warnings; use feature 'say'; use PDL; use Time::HiRes 'time'; my $t = time; use constant MAX => 1e6; use constant TOP => MAX < 20 ? MAX : 20; my $seqs = 1 + sequence( longlong, MAX ); $seqs-> setbadat( 0 ); $seqs-> badvalue( 2 ); my $lengths = ones( short, MAX ); $lengths <<= 1; $lengths-> set( 0, 1 ); while ( any my $good_mask = $seqs-> isgood ) { my ( $seqs_odd, $lengths_odd_masked ) = where( $seqs, $lengths, $seqs & 1 ); $lengths_odd_masked ++; $lengths-> where( $good_mask ) ++; ( $seqs_odd = 3 ) ++; $seqs >>= 1; } my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX - 1 , $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say time - $t; __END__ [ [ 525 837799] [ 509 626331] ... [ 445 886953] [ 445 906175] [ 445 922524] [ 445 922525] ] 6.0809600353241 [download] So we have correct output and improved time. Good. Now to something more interesting -- let's add caching/looking-up. Because we are to use `$seqs` as index into `$lengths`, and indexing starts from 0, let's prepend a dummy 0th element. To kick-start indexing, and because value "2" is occupied to mark "BAD" i.e. sequence stopper, we'll add one more seed element to lengths. Further, lengths will now all start as BAD, and switched to computed values as we go: `[ BAD BAD BAD 3 4 5 6 ... ] # initial $seqs [ BAD 1 2 BAD 3 BAD BAD ... ] # initial $lengths` [download] (by the way, BAD in `$lengths` is still the default, for short, -32768) We'll also maintain `$current` lengths helper piddle, incremented by 1 or 2 depending on oddity mask of current sequences state. Observe, further, how `where` calls in list context are (over)-abused in code below. (I'm quite aware this code is no longer "a prose to read". Set MAX to 10 and dump primary piddles on each iteration to see what's going on. There are same 3. Other vars are masked views into them.) use strict; use warnings; use feature 'say'; use PDL; use Time::HiRes 'time'; my $t = time; use constant MAX => 1e6; use constant TOP => MAX < 20 ? MAX : 20; my $seqs = sequence( longlong, 1 + MAX ); $seqs-> setbadat( 0 ); $seqs-> setbadat( 1 ); $seqs-> badvalue( 2 ); my $lengths = ones( short, 1 + MAX ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); my $current = zeroes( short, 1 + MAX ); while ( any $seqs-> isgood ) { # sic my ( $seqs_odd, $current_odd_masked ) = where( $seqs, $current, $seqs & 1 ); $current_odd_masked ++; $current ++; ( $seqs_odd *= 3 ) ++; $seqs >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs, $lengths, $current, $seqs <= MAX ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } $lengths-> badflag( 0 ); my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX + 1 - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX, $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say time - $t; __END__ [ [ 525 837799] [ 509 626331] ... [ 445 886953] [ 445 906175] [ 445 922524] [ 445 922525] ] 2.88385105133057 [download] And that's (~2x faster) I'm afraid is as good as it will go. As I understand, cache hits are significantly more rare than with consequential element after element array processing. For comparison, with the same hardware, Laurent_R's final/polished caching solution runs at 1.63s here, if I disable use of "magic number" 400 in there ("magic" constants to crank up performance aren't fair:)), and at 0.88s otherwise. For 1e7 numbers, running time becomes ~65s, i.e. it gets impractical, like I said. Switching on parallel processing made no difference. Use of multiple cores can be observed for only ~first second, then "viewports" into piddles become increasingly fragmented, work can't be split. Maybe I did something wrong, and certainly someone can improve even if a little bit, but I'm glad this mini-project is finally off my shoulders.	[reply] [d/l] [select]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun) by marioroy (Prior) on Apr 21, 2020 at 21:02 UTC
Hi vr, Lots of PDL goodness. Wow! I do not know how this will perform unless taking a moment or two and give it a try. So here is a parallel version based on your 2nd example. use strict; use warnings; use feature 'say'; use PDL; use MCE::Flow; use MCE::Candy; use Time::HiRes 'time'; my $t = time; # PDL Quick Reference # https://www.perlmonks.org/?node_id=1214437 sub collatz_seq { my ( $chunk_id, $seq_beg, $seq_end ) = @_; my $max = $seq_end - $seq_beg + 2; my $top = $max < 20 ? $max : 20; my $seqs = pdl( longlong, 1, $seq_beg..$seq_end ); $seqs-> setbadat( 0 ); $seqs-> badvalue( 2 ); my $lengths = 1 + ones( short, $max ); $lengths-> set( 0, 1 ); while ( any my $good_mask = $seqs-> isgood ) { my ( $seqs_odd, $lengths_odd_masked ) = where( $seqs, $lengths, $seqs & 1 ); $lengths_odd_masked ++; $lengths-> where( $good_mask ) ++; ( $seqs_odd = 3 ) ++; $seqs >>= 1; } my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( $max - $top ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ $max - 1 , $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, $top - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ], my $str = $result->string; $str =~ s/(\d+)\s+(\d+)(.)/$2,$1$3,/g; my $ret = eval $str; $_->[0] = $_->[0] + $seq_beg - 2 for @$ret; MCE->gather( $chunk_id, @$ret ); } my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum my $chunk_size = $size >> 5; my @seqs; mce_flow_s { max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size > 100000 ? 100000 : $chunk_size, bounds_only => 1, gather => MCE::Candy::out_iter_array(\@seqs), }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; collatz_seq( $chunk_id, @{ $chunk_ref } ); }, 2, $size; MCE::Flow->finish; @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; say {STDERR} time - $t; [download] Results:* I was expecting for mce_pdl2 using 1 core to be closer to vr_pdl2 than vr_pdl3. Maybe benefitting from CPU L2/L3 cache. Chunking seems to be the reason why mce_pdl2 (non-caching) ran nearly as fast as vr_pdl3 (caching). Below, includes 1 core testing for 1e7 with various chunk sizes. 1e7 testing vr_pdl3: 46.328s cache vr_pdl2: 1m16.058s non-cache mce_pdl2: 1m15.583s chunk_size => $size mce_pdl2: 1m03.709s chunk_size => $size >> 1 mce_pdl2: 52.352s chunk_size => $size >> 2 mce_pdl2: 51.323s chunk_size => $size >> 3 mce_pdl2: 49.396s chunk_size => $size >> 4 mce_pdl2: 48.195s chunk_size => $size >> 5 mce_pdl2: 48.369s chunk_size => $size >> 6 mce_pdl2: 48.501s chunk_size => $size >> 7 chunk_size => 300000 mce_pdl2: 48.195s 1 core mce_pdl2: 25.311s 2 cores mce_pdl2: 14.085s 4 cores mce_pdl2: 7.650s 8 cores mce_pdl2: 4.517s 16 cores mce_pdl2: 3.721s 32 cores chunk_size => 100000 mce_pdl2: 48.395s 1 core mce_pdl2: 25.402s 2 cores mce_pdl2: 13.163s 4 cores mce_pdl2: 6.860s 8 cores mce_pdl2: 3.850s 16 cores mce_pdl2: 2.347s 32 cores Output Collatz(8400511) has sequence length of 686 steps Collatz(8865705) has sequence length of 668 steps Collatz(6649279) has sequence length of 665 steps Collatz(9973919) has sequence length of 663 steps Collatz(6674175) has sequence length of 621 steps Collatz(7332399) has sequence length of 616 steps Collatz(7532665) has sequence length of 616 steps Collatz(5649499) has sequence length of 613 steps Collatz(8474249) has sequence length of 611 steps Collatz(6355687) has sequence length of 608 steps Collatz(8847225) has sequence length of 606 steps Collatz(9533531) has sequence length of 606 steps Collatz(6635419) has sequence length of 603 steps Collatz(9953129) has sequence length of 601 steps Collatz(7464846) has sequence length of 598 steps Collatz(7464847) has sequence length of 598 steps Collatz(3732423) has sequence length of 597 steps Collatz(5598635) has sequence length of 595 steps Collatz(8397953) has sequence length of 593 steps Collatz(6298465) has sequence length of 590 steps [download] 1e8 testing vr_pdl3: 9m44.667s cache vr_pdl2: 16m12.467s non-cache chunk_size => 300000 mce_pdl2: 9m06.078s 1 core mce_pdl2: 4m43.529s 2 cores mce_pdl2: 2m33.136s 4 cores mce_pdl2: 1m21.434s 8 cores mce_pdl2: 45.266s 16 cores mce_pdl2: 36.925s 32 cores chunk_size => 100000 mce_pdl2: 9m09.950s 1 core mce_pdl2: 4m39.677s 2 cores mce_pdl2: 2m24.230s 4 cores mce_pdl2: 1m13.353s 8 cores mce_pdl2: 37.923s 16 cores mce_pdl2: 20.099s 32 cores Output Collatz(63728127) has sequence length of 950 steps Collatz(95592191) has sequence length of 948 steps Collatz(96883183) has sequence length of 811 steps Collatz(86010015) has sequence length of 798 steps Collatz(98110761) has sequence length of 749 steps Collatz(73583070) has sequence length of 746 steps Collatz(73583071) has sequence length of 746 steps Collatz(36791535) has sequence length of 745 steps Collatz(55187303) has sequence length of 743 steps Collatz(56924955) has sequence length of 743 steps Collatz(82780955) has sequence length of 741 steps Collatz(85387433) has sequence length of 741 steps Collatz(63101607) has sequence length of 738 steps Collatz(64040575) has sequence length of 738 steps Collatz(93128574) has sequence length of 736 steps Collatz(93128575) has sequence length of 736 steps Collatz(94652411) has sequence length of 736 steps Collatz(96060863) has sequence length of 736 steps Collatz(46564287) has sequence length of 735 steps Collatz(69846431) has sequence length of 733 steps [download] Specifying 50000 for chunk size may run faster on 4/6/8 core machines. `8 cores mce_pdl2 1e8 16.660s chunk_size => 300000 mce_pdl2 1e8 10.471s chunk_size => 100000 mce_pdl2 1e8 9.773s chunk_size => 50000` [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by vr (Curate) on Apr 26, 2020 at 00:03 UTC
Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :). My "caching" PDL solution (I'd prefer "looking-up" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge: 23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1 Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem. This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The `$current` piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element. For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5. Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz `$lengths` for even numbers in range 22..40, also marking these `$seqs` elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even `$seqs` element outside 1st chunk as BAD, even if CLs are unknown yet. 2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speed-up is very modest, just ~0.5s. Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically). If it's cheap to shift-left indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this. I tried to add masks/slices/bads for `$current`, so that e.g. the line `$current ++;` applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight". I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up. use strict; use warnings; use feature 'say'; use PDL; use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 8e4, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; my $t = time; my $seqs = sequence( longlong, 1 + MAX ); $seqs-> setbadat( 0 ); $seqs-> setbadat( 1 ); $seqs-> badvalue( 2 ); $seqs-> slice([ CHUNK + 2, MAX, 2]) .= 2 ## if CHUNK + 2 <= MAX; ## my $lengths = ones( short, 1 + MAXLEN ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); CHUNKS: for ( my $from = my $to = 0; $to != MAX; $from = $to + 1 ) { $to = _min( $from + CHUNK, MAX ); # "_c" is for "chunk" my $seqs_c = $seqs-> slice([ $from, $to ]); my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); ITERATIONS: while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN ## } # same finale $lengths-> badflag( 0 ); $lengths = $lengths-> slice([ 1, MAX ]); my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX - 1, $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say time - $t; __END__ [download] Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "0-10,11-20,21-30, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry.	[reply] [d/l] [select]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 26, 2020 at 07:46 UTC
Re^5: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 26, 2020 at 22:14 UTC
Some notes below your chosen depth have not been shown here
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) by Laurent_R (Canon) on Apr 14, 2020 at 11:16 UTC
Hi Nick, for some reason, I somehow missed your very interesting post and the rest of this thread until just now. In fact, a couple of days after I submitted my solution to the Perl Weekly Challenge and posted my blog post you refer to above, I figured out that my caching strategy was in fact quite inefficient: the program doesn't need to store in the cache the full sequence, it would be enough to just store the number of its items. And that reduces considerably the overhead of the cache. I did not try this change in Perl so far, but I did it with the Raku solution. Changing the caching strategy made the Raku program 6 times faster. It seems the changes I made are quite similar to what vr suggested. These results are described in another blog that I haven't completed yet, but will be published hopefully soon. Now that I see that you have brought this up, I feel compelled to implement this change in the Perl version. I don't have time right now, but will come back with my updated version and the relevant timings soon. (Update: see below.) Thank you anyway for your very interesting post and for the just as interesting discussion it triggered.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


Don't ask to ask, just ask
	PerlMonks