Mon cher ami Laurent_R recently blogged about his solution to the "extra credit" problem in the Perl Weekly Challenge #54. He showed a solution using memoizing, or caching, to reduce the number of repeated calculations made by a program.
I wondered about the strategy. Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow. But the task looks computationintensive, so I wanted to see whether more cycles would be more beneficial than caching.
Here is the solution presented by Laurent:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use Data::Dumper;
use constant MAX => 300000;
my %cache;
sub collatz_seq {
my $input = shift;
my $n = $input;
my @result;
while ($n != 1) {
if (exists $cache{$n}) {
push @result, @{$cache{$n}};
last;
} else {
my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
push @result, $new_n;
$cache{$n} = [$new_n, @{$cache{$new_n}}]
if defined ($cache{$new_n}) and $n < MAX;
$n = $new_n;
}
}
$cache{$input} = [@result] if $n < MAX;
return @result;
}
my @long_seqs;
for my $num (1..1000000) {
my @seq = ($num, collatz_seq $num);
push @long_seqs, [ $num, scalar @seq] if scalar @seq > 400;
}
@long_seqs = sort { $b>[1] <=> $a>[1]} @long_seqs;
say "$_>[0]: $_>[1]" for @long_seqs[0..19];
# say "@{$cache{$long_seqs[0][0]}}";
This runs on my system pretty quickly:
real 0m22.596s
user 0m21.530s
sys 0m1.045s
Next I ran the following version using mce_map_s from MCE::Map. mce_map_s is an implementation of the parallelized map functionality provided by MCE::Map, optimized for sequences. Each worker is handed only the beginning and end of the chunk of the sequence it will process, and workers communicate amongst themselves to keep track of the overall task. When using mce_map_s, pass only the beginning and end of the sequence to process (also, optionally, the step interval and format).
use strict; use warnings; use feature 'say';
use Data::Dumper;
use MCE::Map;
my @output = mce_map_s {
my $input = $_;
my $n = $input;
my @result = $input;
while ( $n != 1 ) {
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
push @result, $n;
}
return [ $input, scalar @result ];
} 1, 1000000;
MCE::Map>finish;
@output = sort { $b>[1] <=> $a>[1] } @output;
say sprintf('%s : length %s', $_>[0], $_>[1]) for @output[0..19];
This program, with no caching, runs on my system about five times faster (I have a total of 12 cores):
real 0m4.322s
user 0m27.992s
sys 0m0.170s
Notably, reducing the number of workers to just two still ran the program in less than half the time than Laurent's singleprocess memoized version. Even running with one process, with no cache, was faster. This is no doubt due to the fact MCE uses chunking by default. Even with one worker the list of one million numbers was split by MCE into chunks of 8,000.
Next, I implemented Laurent's cache strategy, but using MCE::Shared::Hash. I wasn't really surprised that the program then ran much slower than either previous version. The reason, of course, is that this task pretty much only makes use of the CPU, so while throwing more cycles at it it a huge boost, sharing data among the workers  precisely because the task is almost 100% CPUbound  only slows them down. Modern CPUs are very fast at crunching numbers.
I was curious about how busy the cache was in this case, so I wrapped the calls to assign to and read from the hash in Laurent's program in a sub so I could count them. The wrappers look like:
my %cache;
my $sets = my $gets = 0;
sub cache_has { $gets++; exists $cache{$_[0]} }
sub cache_set { $sets++; $cache{$_[0]} = $_[1] }
sub cache_get { $gets++; $cache{$_[0]} }
The result:
Sets: 659,948
Gets: 16,261,635
That's a lot of back and forth.
So the moral of the story is that while caching is often useful when you are going to make the same calculations over and over, sometimes the cost of the caching exceeds the cost of just making the calculations repeatedly.
Hope this is of interest!
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by marioroy (Vicar) on Apr 06, 2020 at 05:55 UTC

Hi 1nickt,
One may simulate extra CPU time to better understand the benefit of caching. I converted Laurent_R's example to consume multiple CPU cores. Laurent's example incurs extra overhead due to calling a subroutine inside the loop and returning a full list. I changed the number of iterations from 1 million to 100 thousand now that simulating extra CPU cycles.
Nick's example
use strict;
use warnings;
use feature 'say';
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes 'usleep';
my @output = mce_map_s {
my $input = $_;
my $n = $input;
my @result = $input;
while ( $n != 1 ) {
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
usleep 20; # simulate extra computation
push @result, $n;
}
return [ $input, scalar @result ];
} 1, 100000;
MCE::Map>finish;
@output = sort { $b>[1] <=> $a>[1] } @output;
say sprintf('%s : length %s', $_>[0], $_>[1]) for @output[0..19];
Laurent's example  no caching
use strict;
use warnings;
use feature qw/say/;
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes qw/usleep/;
sub collatz_seq {
my $n = shift;
my @result;
while ($n != 1) {
my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
usleep 20; # simulate extra computation
push @result, $new_n;
$n = $new_n;
}
return @result;
}
my @long_seqs = mce_map_s {
my $num = $_;
my @seq = ($num, collatz_seq $num);
return [ $num, scalar @seq ];
} 1, 100000;
MCE::Map>finish;
@long_seqs = sort { $b>[1] <=> $a>[1] } @long_seqs;
say "$_>[0] : length $_>[1]" for @long_seqs[0..19];
Laurent's example  with caching
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 300000;
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes qw/usleep/;
my %cache;
sub collatz_seq {
my $input = shift;
my $n = $input;
my @result;
while ($n != 1) {
if (exists $cache{$n}) {
push @result, @{ $cache{$n} };
last;
}
my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
usleep 20; # simulate extra computation
push @result, $new_n;
$cache{$n} = [ $new_n, @{ $cache{$new_n} } ]
if $n < MAX && exists $cache{$new_n};
$n = $new_n;
}
$cache{$input} = \@result if $n < MAX;
return @result;
}
my @long_seqs = mce_map_s {
my $num = $_;
my @seq = ($num, collatz_seq $num);
return [ $num, scalar @seq ];
} 1, 100000;
MCE::Map>finish;
@long_seqs = sort { $b>[1] <=> $a>[1] } @long_seqs;
say "$_>[0] : length $_>[1]" for @long_seqs[0..19];
Program output
77031 : length 351
52527 : length 340
78791 : length 338
60975 : length 335
87087 : length 333
88059 : length 333
91463 : length 333
63387 : length 330
95081 : length 328
99067 : length 328
99721 : length 328
71310 : length 325
71311 : length 325
74791 : length 325
74793 : length 325
35655 : length 324
53483 : length 322
56095 : length 322
80225 : length 320
81159 : length 320
Time to run with 12 cores using a Linux VM
Notice the extra 4 seconds user time (likely function overhead inside loop), but not felt in real time (wall clock time).
Nick's example: 1m32.958s real, 0m56.561s user  no caching
Laurent's example: 1m33.454s real, 1m00.491s user  no caching
Laurent's example: 0m18.761s real, 0m20.990s user  caching
Caching is helpful here (5x faster) because I added sleep to simulate more CPU cycles. However, caching is not helpful when the overhead involved is greater than the computation itself as demonstrated by 1nickt's comparison above.
Regards, Mario
 [reply] [d/l] [select] 
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by davido (Cardinal) on Apr 05, 2020 at 21:47 UTC

Hi Nick.
I wonder if you could use an algorithm that could actually benefit both from parallel processing and from caching.
If you take your data set and interleaf it /8 or /12 (for your 12 cores), and then do the caching within each interleaf, you will have to regenerate the cache for each worker, but each worker maintains its own cache so that there's no periteration write back.
iteration 1 iteration 2 iteration 3 iteration 4
worker 1 1 9 17 25
worker 2 2 10 18 26
worker 3 3 11 19 27
worker 4 4 12 20 28
worker 5 5 13 21 29
worker 6 6 14 22 30
worker 7 7 15 23 31
worker 8 8 16 24 32
So worker 1 gets 1, 9, 17, 25, ... to work on. Worker 2 gets 2, 10, 18, 26. Worker 3 gets 3, 11, 19, 27. And so on.
Within each worker implement caching. If I were using Parallel::ForkManager I would have them each return a structure identifying the worker starting offset, and the sequences. That way you don't have to compose them back into a master structure at the end for output, you can sequence them roundrobin based on the starting offset of each.
I've sat down a few times today with the intent to give it a try, but keep getting interrupted before I get very far into it. So it's probably not going to happen.
Interleafing ought to allow each worker to build a cache pretty quickly.
 [reply] 

Hi davido,
I did indeed try that approach, and while I did not interleave the sequence, as I mentioned MCE chunks it up. (I don't think interleaving would really much affect the cache hits, as the caching tested stores the sequence for each number found, and the algorithm produces sequences that themselves include numbers greater than n, so such a boundary would be difficult to enforce.)
In testing, with each worker building and using its own cache for the chuck processed, the program ran almost three times slower than just letting the workers hammer the CPU. Again, I believe it's because in this case the overhead of caching outweighs its benefits.
The way forward always starts with a minimal test.
 [reply] [d/l] 
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?)
by vr (Curate) on Apr 07, 2020 at 12:43 UTC

FWIW, here's a better singlethreaded caching algorithm, which at my machine with 4 puny little cores outperforms others I checked:
use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';
my $t = time;
my %cache = ( 1 => [ 1, 1, undef ]);
#
# [ seq_length, start_num (same as key), next_num ]
#
sub _cache_collatz {
my $n = shift;
return if exists $cache{ $n };
my ( @tmp , $new_n );
while ( !exists $cache{ $n }) {
$new_n = $n % 2 ? 3 * $n + 1 : $n >> 1;
push @tmp, $cache{ $n } = [ $#tmp, $n, $new_n ];
$n = $new_n
}
my $d = $#tmp + $cache{ $n }[ 0 ];
$_> [ 0 ] += $d for @tmp
}
sub collatz {
my $n = shift;
_cache_collatz( $n );
my @a = $n;
push @a, $n while $n = $cache{ $n }[ 2 ];
return \@a;
}
## Demo
#
# dd collatz( 23 );
# dd \%cache;
# exit( 0 );
#
##
_cache_collatz( $_ ) for 1 .. 1e6;
say "@$_[ 1, 0 ]"
for ( sort { $b> [ 0 ] <=> $a> [ 0 ]}
@cache{ 1 .. 1e6 })[ 0 .. 19 ];
say time  $t;
__END__
837799 525
626331 509
939497 507
704623 504
910107 476
927003 476
511935 470
767903 468
796095 468
970599 458
546681 452
818943 450
820022 450
820023 450
410011 449
615017 447
886953 445
906175 445
922524 445
922525 445
6.44625306129456
For comparison:
Apparently, the last one will beat my "better caching" if run on 6 or more cores, but, like I said, I have only 4 :)
 [reply] [d/l] 

Hi vr,
Wow! Curiosity moved me to try with MCE. But first, I optimized Nick's code. Also, workers now send only the relevant data and not the entire list. This decreases the time needed for the manager process to output results.
Update 1: Slight optimization to Choroba's example (i.e. >> 1). Added collatz_compress.
Update 2: Added collatz_count.
Update 3: Improved collatz_count.
Update 4: Added Inline C times.
Demo choroba, 1nickt, compress and count all in one.
use strict;
use warnings;
use feature 'say';
use MCE::Flow;
sub collatz_choroba {
my ($n) = @_;
my @seq = $n;
push @seq, ( $seq[1] >> 1, 3 * $seq[1] + 1 )[ $seq[1] % 2 ]
while $seq[1] != 1;
return scalar @seq;
}
sub collatz_1nickt {
my ($n) = @_;
my @seq = $n;
push @seq, $n = $n % 2 ? 3 * $n + 1 : $n >> 1
while $n != 1;
return scalar @seq;
}
sub collatz_compress {
my ($n) = @_;
my @seq = $n;
# Notation and compressed dynamics (2:15 into the video)
# https://www.youtube.com/watch?v=t1I9uHF9X5Y
# consider the map
# C(x) = $n % 2 ? 3 * $n + 1 : $n / 2
# compress the map to the dynamically equivalent
# T(x) = $n % 2 ? (3 * $n + 1) / 2 : $n / 2
# compress loop iteration when odd sequence
push @seq, $n % 2 ? ( $n = 3 * $n + 1, $n = $n >> 1 )
: ( $n = $n >> 1 )
while $n != 1;
return scalar @seq;
}
sub collatz_count {
my ($n) = @_;
my $steps = 1;
# count using the T(x) notation
$n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 )
: ( $steps += 1, $n = $n >> 1 )
while $n != 1;
return $steps;
}
#*collatz = \&collatz_choroba; # choose collatz here
#*collatz = \&collatz_1nickt;
#*collatz = \&collatz_compress;
*collatz = \&collatz_count;
my @sizes;
MCE::Flow>init(
max_workers => MCE::Util::get_ncpu(),
gather => \@sizes,
bounds_only => 1,
);
mce_flow_s sub {
my ($start, $end) = @{ $_[1] };
my @ret = map { [ $_, collatz($_) ] } $start .. $end;
MCE>gather( ( sort { $b>[1] <=> $a>[1] } @ret )[ 0..19 ] );
}, 1, 1e6;
MCE::Flow>finish;
say "@$_"
for ( sort { $b>[1] <=> $a>[1] } @sizes )[ 0..19 ];
Demo caching by vr
use strict;
use warnings;
use feature 'say';
use MCE::Flow;
my %cache = ( 1 => [ 1, 1, undef ]);
#
# [ seq_length, start_num (same as key), next_num ]
#
sub collatz_vr {
my $n = shift;
return if exists $cache{ $n };
my ( @tmp, $new_n );
while ( !exists $cache{ $n } ) {
$new_n = $n % 2 ? 3 * $n + 1 : $n >> 1;
push @tmp, $cache{ $n } = [ $#tmp, $n, $new_n ];
$n = $new_n
}
my $d = $#tmp + $cache{ $n }[ 0 ];
$_>[ 0 ] += $d for @tmp;
}
my @sizes;
mce_flow_s {
max_workers => MCE::Util::get_ncpu(),
gather => \@sizes,
bounds_only => 1,
}, sub {
my ($start, $end) = @{ $_[1] };
collatz_vr($_) for $start .. $end;
MCE>gather(
( sort { $b>[0] <=> $a>[0] } @cache{ $start .. $end } )[ 0 .
+. 19 ]
);
}, 1, 1e6;
MCE::Flow>finish;
say "@$_[ 1, 0 ]"
for ( sort { $b>[0] <=> $a>[0] } @sizes )[ 0 .. 19 ];
Run time on a 32 core machine  SMT disabled
The times include launching Perl, loading modules, spawning workers, reaping workers, and output (~ 0.100 seconds). See the next post for the Inline C demonstration.
max_workers => 1
collatz_vr 3.617s 1.0x
collatz_choroba 14.896s 1.0x
collatz_1nickt 12.264s 1.0x
collatz_compress 8.932s 1.0x
collatz_count 7.021s 1.0x
collatz_count_c1 0.741s 1.0x Inline C
collatz_count_c2 0.625s 1.0x Inline C
max_workers => 2
collatz_vr 2.707s 1.3x
collatz_choroba 7.538s 2.0x
collatz_1nickt 6.141s 2.0x
collatz_compress 4.490s 2.0x
collatz_count 3.537s 2.0x
collatz_count_c1 0.398s 1.9x Inline C
collatz_count_c2 0.339s 1.8x Inline C
max_workers => 4
collatz_vr 1.906s 1.9x
collatz_choroba 3.948s 3.8x
collatz_1nickt 3.274s 3.7x
collatz_compress 2.365s 3.8x
collatz_count 1.873s 3.7x
collatz_count_c1 0.235s 3.2x Inline C
collatz_count_c2 0.203s 3.1x Inline C
max_workers => 8
collatz_vr 1.458s 2.5x
collatz_choroba 1.995s 7.5x
collatz_1nickt 1.655s 7.4x
collatz_compress 1.209s 7.4x
collatz_count 0.961s 7.3x
collatz_count_c1 0.149s 5.0x Inline C
collatz_count_c2 0.132s 4.7x Inline C
max_workers => 16
collatz_vr 0.976s 3.7x
collatz_choroba 1.034s 14.4x
collatz_1nickt 0.858s 14.3x
collatz_compress 0.637s 14.0x
collatz_count 0.509s 13.8x
collatz_count_c1 0.114s 6.5x Inline C
collatz_count_c2 0.105s 6.0x Inline C
max_workers => 32
collatz_vr 0.837s 4.3x
collatz_choroba 0.593s 25.1x
collatz_1nickt 0.497s 24.7x
collatz_compress 0.379s 23.6x
collatz_count 0.310s 22.6x
collatz_count_c1 0.112s 6.6x Inline C
collatz_count_c2 0.108s 5.8x Inline C
Interesting. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling versus the noncache demonstrations.
collatz_count (2 cores) reaches collatz_vr (1 core).
collatz_count (4 cores) reaches collatz_vr (4 cores).
Kind regards, Mario
 [reply] [d/l] [select] 

Hi,
There was one thing left to try and that is using Inline::C. I captured the top 20 for 1e9 and 1e10.
Update 1: Set chunk_size to reduce IPC. Tried staying within CPU cache by keeping @ret small. 1e10 now completes in 4 minutes (20% improvement).
Update 2: Added collatz_count_c2 using GCC compiler intrinsics.
use strict;
use warnings;
use feature 'say';
use MCE::Flow;
use Inline C => Config =>
CCFLAGSEX => 'O2 fomitframepointer', clean_after_build => 0;
use Inline C => <<'END_OF_C_CODE';
#include <stdlib.h>
#include <stdint.h>
#if defined(_WIN32)
#define strtoull _strtoui64
#endif
void collatz_count_c1( SV* _n )
{
uint64_t n, steps = 1; /* include 1 */
Inline_Stack_Vars;
#ifdef __LP64__
n = SvUV( _n );
#else
n = strtoull( SvPV_nolen( _n ), NULL, 10 );
#endif
/* count using the T(x) notation */
do {
n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 )
: ( steps += 1, n = n >> 1 );
} while ( n != 1 );
Inline_Stack_Reset;
Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) );
Inline_Stack_Done;
}
void collatz_count_c2( SV* _n )
{
uint64_t n, steps = 1; /* include 1 */
Inline_Stack_Vars;
#ifdef __LP64__
n = SvUV( _n );
#else
n = strtoull( SvPV_nolen( _n ), NULL, 10 );
#endif
/* based on GCC compiler intrinsics demonstration by Alex Shirley
+*/
/* https://stackoverflow.com/questions/38114205/ccollatzconjectu
+reoptimization */
/* https://www.go4expert.com/articles/builtingccfunctionsbuilti
+nclzt29238 */
if ( n % 2 == 0 ) {
steps += __builtin_ctz(n); /* account for all evens */
n >>= __builtin_ctz(n); /* always returns an odd */
}
/* when we enter we're always working on an odd number */
do {
n = 3 * n + 1;
steps += __builtin_ctz(n) + 1; /* account for odd and even */
n >>= __builtin_ctz(n); /* always returns an odd */
} while ( n != 1 );
Inline_Stack_Reset;
Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) );
Inline_Stack_Done;
}
END_OF_C_CODE
sub collatz_count {
my ($n) = @_;
my $steps = 1; # count 1
# count using the T(x) notation
$n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 )
: ( $steps += 1, $n = $n >> 1 )
while $n != 1;
return $steps;
}
no warnings 'once';
#*collatz = \&collatz_count; # choose collatz here
#*collatz = \&collatz_count_c1; # using the T(x) notation
*collatz = \&collatz_count_c2; # using compiler intrinsics
my $m = shift  1e6;
my ( @sizes, $chunk_size );
$chunk_size = int( $m / MCE::Util::get_ncpu() / 40 + 1 );
$chunk_size += 1 if $chunk_size % 2;
MCE::Flow>init(
max_workers => MCE::Util::get_ncpu(),
chunk_size => $chunk_size,
gather => \@sizes,
bounds_only => 1,
);
mce_flow_s sub {
my ( $start, $end ) = @{ $_[1] }; my @ret;
for ( $start .. $end ) {
push @ret, [ $_, collatz($_) ];
@ret = ( sort { $b>[1] <=> $a>[1] } @ret )[ 0..19 ]
if @ret > 100;
}
( @ret > 20 )
? MCE>gather( ( sort { $b>[1] <=> $a>[1] } @ret )[ 0..19 ]
+)
: MCE>gather( @ret );
}, 1, $m;
MCE::Flow>finish;
say "@$_"
for ( sort { $b>[1] <=> $a>[1] } @sizes )[ 0..19 ];
Output
1..1e9
collatz_count 6m10.877s 32 cores Perl
collatz_count_c1 11m20.134s 1 core Inline C
collatz_count_c2 8m56.683s 1 core Inline C
collatz_count_c1 0m23.443s 32 cores 29.0x
collatz_count_c2 0m18.507s 32 cores 29.0x
670617279 987
848749995 977
954843745 972
954843751 972
716132809 969
537099606 966
537099607 966
268549803 965
805649409 964
805649410 964
805649411 964
805649415 964
402824705 963
604237057 961
604237058 961
604237059 961
302118529 960
906355586 959
906355587 959
906355588 959
1..1e10, 32 cores, Inline C
collatz_count_c1 4m00.323s
collatz_count_c2 3m09.774s
9780657630 1133
9780657631 1133
4890328815 1132
7335493223 1130
6189322407 1122
9283983611 1120
8140184737 1094
6105138553 1091
9157707830 1089
9157707831 1089
4578853915 1088
6868280873 1086
5151210655 1083
7726815983 1081
9282648843 1058
6961986633 1055
5221489974 1052
5221489975 1052
2610744987 1051
7832234962 1050
1..1e11, 32 cores, Inline C
collatz_count_c1 40m50.177s
collatz_count_c2 32m17.319s
75128138247 1229
84519155529 1224
85833820801 1224
63389366646 1221
63389366647 1221
64375365601 1221
31694683323 1220
95084049969 1219
95084049970 1219
95084049971 1219
96563048402 1219
96563048403 1219
47542024985 1218
48281524201 1218
71313037476 1216
71313037477 1216
71313037478 1216
71313037479 1216
72422286302 1216
72422286303 1216
Look at C go!
Regards, Mario
 [reply] [d/l] [select] 

use Time::HiRes 'time';
my $t = time;
# the whole script here
say time  $t;
MCE::Flow>finish;
say time  $t;
__END__
1 worker:
6.01482510566711
7.76495599746704
2 workers:
4.12953305244446
7.07751798629761
4 workers:
3.33010196685791
8.4802930355072
1st measurement approximately matches your output, but 1.5  2 seconds per worker to shutdown doesn't look OK to me.
For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling as well versus the noncache demonstrations
In other words, lots and lots of work is needlessly duplicated.
_cache_collatz( $_ ) for 1 .. 1e6;
say scalar %cache;
%cache = ();
_cache_collatz( $_ ) for 1 + 4e5 .. 1e6;
say scalar %cache;
__END__
2168611
2168611
So, effectively, in situation with e.g. 10 workers, 4 junior workers, filling cache for ranges up to 400_000, are free to slack off and not gather their results at all, and there's large amount of overlap in work of 6 senior workers, too. For now, I have no solid idea how to parallelize this algorithm efficiently.
 [reply] [d/l] [select] 


 [reply] 
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by choroba (Archbishop) on Apr 06, 2020 at 09:32 UTC

> Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow
Interestingly, the difference is about 15%, see my blogpost about the same task.
map{substr$_>[0],$_>[1]0,1}[\*{},3],[[]],[ref qr1,,1],[{}],[sub{}^*ARGV,3]
 [reply] [d/l] 

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
sub collatz {
my ($start) = @_;
my @seq = $start;
push @seq, ($seq[1] / 2, 3 * $seq[1] + 1)[$seq[1] % 2]
while $seq[1] != 1;
return @seq
}
my @sizes = mce_map_s { [$_, scalar collatz($_)] } 1, 1e6;
say "@$_" for reverse +(sort { $b>[1] <=> $a>[1] } @sizes)[0 .. 19];
Before and after on a 6 core machine with hyperthreading  12 logical cores.
Serial 48.170 seconds
Parallel 9.361 seconds
Regards, Mario
 [reply] [d/l] [select] 

Interesting! Does mce_map_s return the elements in the corresponding order? As you can see, it's not needed here, so if there's a faster method that doesn't care about the order, we can also switch to that.
map{substr$_>[0],$_>[1]0,1}[\*{},3],[[]],[ref qr1,,1],[{}],[sub{}^*ARGV,3]
 [reply] [d/l] 



Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by rjt (Curate) on Apr 19, 2020 at 23:31 UTC

my @seqlen = (1,1); # Memoize sequence length
my $top = 20; # Report this many of the top sequences
my @top = [ 1,1 ]; # Top $top sequences
my $upper = 1e6; # Upper limit starting term
my $mintop = 0; # Lowest value in @top
GetOptions('top=i' => \$top, 'upper=i' => \$upper);
Here is the main loop:
for (my $start = 3; $start < $upper; $start += 2) {
my ($n, $len) = ($start, 0);
while (! defined $seqlen[$n]) {
$len += 1 + $n % 2;
$n = $n % 2 ? (3*$n + 1)/2 : $n / 2;
}
$len += $seqlen[$n];
$seqlen[$start] = $len if $start < $upper * 2; # Cache
top($start => $len) if $len > $mintop and $start <
+= $upper;
top($n * 2 => $seqlen[$n] + 1) if $n < $upper/2 and $seqlen[$n]
+> $mintop;
}
I avoid function call overhead by making sure everything is done iteratively. As another optimization, instead of simply doing 3n+1 for odd numbers, I do 3n+1/2, and increment the sequence length by two instead of one. And finally, I'm able to skip evennumbered starts with a little creative arithmetic with my second call to top().
I decided to ask people to output the top 20, because that presents an interesting minichallenge by itself. Maintaining it naively by calling sort on a million elements at the end takes longer than the above loop, and sorting a 20item list repeatedly is even worse. Maintaining essentially a priority queue is much faster:
sub top {
my ($n, $len) = @_;
my $idx = first { $top[$_][1] < $len } 0..$#top;
splice @top, $idx, 0, [ $n, $len ];
pop @top if @top > $top;
$mintop = $top[1][1];
}
The above sub is O(n), so it's not as good as a heap implementation, but it's only called when there is definitely a new element to be inserted, thanks to a bit of bookkeeping in $mintop, so I opted to keep it simple.
Performance:
real 0m0.848s
user 0m0.835s
sys 0m0.012s
Purely for crude CPU comparison purposes, Laurent's solution in Re: Optimizing with Caching vs. Parallelizing (MCE::Map) runs in 1.57 sec on the same (virtual) machine.
use strict; use warnings; omitted for brevity.
 [reply] [d/l] [select] 

Many thanks for providing this wonderful challenge! If all goes at the pace it's rolling now, the fun may well continue into the month of May! Therefore, a warning: some people are at danger of gaining (or loosing) more than what they have bargained for!:)
I decided to ask people to output the top 20, because that presents an interesting minichallenge by itself
Well said, dear rjt, well said. But it seems it were YOU, who fell into the trap that you so cunningly crafted for poor innocent learners! The top 20 out of million Collatz sequences has an insidious property: there are six "445" lengths in 1e6, but only four in top20 (so, all six are in top22). Which to extract? Aha!
Well, the challenge did not clearly state how to order numbers with the same Collatz lengths (CL). But I think it is reasonable to assume, that, since numbers with longer CL are rated better/closer to top, then smaller numbers among producing same CL are to be valued more. Like: "Look! This brave little number commendably creates as long CL as that huge number! And this undeservedly huge number has only managed to generate so puny CL! Loser!"
At least, there must be some consistency in arranging results, don't you agree? In other words, I think that if CL column descends, then numbers column must ascend (for equal CLs). See ordering for CL 450, too.
Well, dear Perl users, I happened to notice this, because in another my (unpublished) solution I had to endure great pain in arranging top20 properly. Just how many of you, looking at output of (almost all) scripts in this and related 2 threads, have noticed, that top 20 are neatly ordered? And yet, this "natural" result comes at no cost at all, with Perl! Just appreciate what you so ungratefully consume!:)
To illustrate, rjt, here's, in parallel, output of your script and Laurent_R's with marioroy fixes:
Collatz(837799) has sequence length of 525 steps  837799: 525
Collatz(626331) has sequence length of 509 steps  626331: 509
Collatz(939497) has sequence length of 507 steps  939497: 507
Collatz(704623) has sequence length of 504 steps  704623: 504
Collatz(910107) has sequence length of 476 steps  910107: 476
Collatz(927003) has sequence length of 476 steps  927003: 476
Collatz(511935) has sequence length of 470 steps  511935: 470
Collatz(767903) has sequence length of 468 steps  767903: 468
Collatz(796095) has sequence length of 468 steps  796095: 468
Collatz(970599) has sequence length of 458 steps  970599: 458
Collatz(546681) has sequence length of 452 steps  546681: 452
Collatz(820022) has sequence length of 450 steps  818943: 450
Collatz(818943) has sequence length of 450 steps  820022: 450
Collatz(820023) has sequence length of 450 steps  820023: 450
Collatz(410011) has sequence length of 449 steps  410011: 449
Collatz(615017) has sequence length of 447 steps  615017: 447
Collatz(922526) has sequence length of 445 steps  886953: 445
Collatz(922526) has sequence length of 445 steps  906175: 445
Collatz(886953) has sequence length of 445 steps  922524: 445
Collatz(906175) has sequence length of 445 steps  922525: 445
But wait... What's that??? The 922526 number is listed twice, on the left?? Is this... can't believe... is this because of infamous What Every Computer Scientist Should Know About FloatingPoint Arithmetic? Because someone:) decided to cut corners and resort to FP? Or is it for another reason? Didn't investigate yet. Just who would have thought that this would surface in so innocent task "calculate lengths of Collatz sequences". Wow! Great challenge!
Edit. No, of course it's not floating point issue. Algo is broken. CLs for odd numbers are cached, but for even numbers they are not. Some even numbers are never passed to &top (e.g. 922524), others are pumped into this subroutine several times. The 922526 hadn't been phased out from @top by odd numbers with longer CLs. With $top large enough, there are many even dupes.
 [reply] [d/l] 

Ha! Good eye. I didn't even notice the doubledup 922526. It's not a FP bug. Rather, it's a corner case thanks to how I combined the /2 optimization + memoization; certain even numbers get added to the p.queue twice. It can be fixed trivially by either removing the /2 optimization (simpler, ~5% penalty), or skipping seen numbers in the second call to top() (no measurable penalty, adds a variable).
As to your interpretation of the "top 20 arrangement," I like your discussion! We try to keep the task descriptions only quasiformal, to keep the challenge accessible to beginners, which is why you don't usually see these sorts of details specified like a requirements document. Meaning, many "minor" details are left to the discretion of the participants. The upshot of that is, if you submit a really weird interpretation, you'd probably net yourself a mildly amusing comment in my next review, at least. :)
use strict; use warnings; omitted for brevity.
 [reply] [d/l] [select] 

Hi rjt,
Thank you for this challenge. This consumed so much of my time in a great way. The reason is partly due to, "What if possible for many CPU cores?" But first made attempts for fast using 1 core. Below are the 3 progressive solutions, each one running faster.
Update: Added results from two machines.
Laurent's demonstration plus updates:
#!/usr/bin/env perl
use strict;
use warnings;
my $size = shift  1e6;
$size = 1e6 if $size < 1e6; # minimum
$size = 1e9 if $size > 1e9; # maximum
##
# Laurent's demonstration + updates
# https://www.perlmonks.org/?node_id=11115520
# https://www.perlmonks.org/?node_id=11115540
#
# Parallel solution
# https://www.perlmonks.org/?node_id=11115544
##
my @cache = (0, 1, 2);
my @seqs;
sub collatz_seq {
my $size = shift;
my ($n, $steps);
for my $input (2..$size) {
$n = $input, $steps = 0;
while ($n != 1) {
$steps += $cache[$n], last if defined $cache[$n];
$n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 )
: ( $steps += 1, $n = $n >> 1 );
}
$cache[$input] = $steps if $input < $size;
push @seqs, [ $input, $steps ] if $steps > 400;
}
}
collatz_seq($size);
@seqs = ( sort { $b>[1] <=> $a>[1]} @seqs )[ 0..19 ];
printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
for @seqs;
iM71's C++ demonstration converted to Perl plus updates:
#!/usr/bin/env perl
use strict;
use warnings;
my $size = shift  1e6;
$size = 1e6 if $size < 1e6; # minimum
$size = 1e9 if $size > 1e9; # maximum
##
# iM71's demonstration + applied T(x) notation and compression
# https://stackoverflow.com/a/55361008
# https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video)
#
# Parallel solution
# https://www.perlmonks.org/?node_id=11115780
##
my @cache = (0, 1, 2);
my @seqs;
sub collatz_seq {
my $size = shift;
my ($n, $steps);
for my $input (2..$size) {
$n = $input, $steps = 0;
$n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 )
: ( $steps += 1, $n = $n >> 1 )
while $n != 1 && $n >= $input;
$cache[$input] = $steps += $cache[$n];
push @seqs, [ $input, $steps ] if $steps > 400;
}
}
collatz_seq($size);
@seqs = ( sort { $b>[1] <=> $a>[1]} @seqs )[ 0..19 ];
printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
for @seqs;
Step counting using Inline C:
#!/usr/bin/env perl
use strict;
use warnings;
use Inline C => Config => CCFLAGSEX => 'O2 fomitframepointer';
use Inline C => <<'END_OF_C_CODE';
#include <stdint.h>
void num_steps_c( SV* _n, SV* _s )
{
uint64_t n, input;
int steps = 0;
n = input = SvUV(_n);
while ( n != 1 && n >= input ) {
n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 )
: ( steps += 1, n = n >> 1 );
}
sv_setuv(_n, n);
sv_setiv(_s, steps);
return;
}
END_OF_C_CODE
my $size = shift  1e6;
$size = 1e6 if $size < 1e6; # minimum
$size = 1e9 if $size > 1e9; # maximum
##
# iM71's demonstration + applied T(x) notation and compression
# https://stackoverflow.com/a/55361008
# https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video)
#
# Parallel solution
# https://www.perlmonks.org/?node_id=11115780
##
my @cache = (0, 1, 2);
my @seqs;
sub collatz_seq {
my $size = shift;
my ($n, $steps);
for my $input (2..$size) {
num_steps_c($n = $input, $steps);
$cache[$input] = $steps += $cache[$n];
push @seqs, [ $input, $steps ] if $steps > 400;
}
}
collatz_seq($size);
@seqs = ( sort { $b>[1] <=> $a>[1]} @seqs )[ 0..19 ];
printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
for @seqs;
Results from two machines:
64bit VM:
rjt 0.903s
Laurent + updates 0.696s
iM71 + updates 0.602s
Step counting in C 0.273s (1st time involves compiling)
AMD 3970x:
rjt 0.635s
Laurent + updates 0.516s
iM71 + updates 0.467s
Step counting in C 0.191s (1st time involves compiling)
Collatz(837799) has sequence length of 525 steps
Collatz(626331) has sequence length of 509 steps
Collatz(939497) has sequence length of 507 steps
Collatz(704623) has sequence length of 504 steps
Collatz(910107) has sequence length of 476 steps
Collatz(927003) has sequence length of 476 steps
Collatz(511935) has sequence length of 470 steps
Collatz(767903) has sequence length of 468 steps
Collatz(796095) has sequence length of 468 steps
Collatz(970599) has sequence length of 458 steps
Collatz(546681) has sequence length of 452 steps
Collatz(818943) has sequence length of 450 steps
Collatz(820022) has sequence length of 450 steps
Collatz(820023) has sequence length of 450 steps
Collatz(410011) has sequence length of 449 steps
Collatz(615017) has sequence length of 447 steps
Collatz(886953) has sequence length of 445 steps
Collatz(906175) has sequence length of 445 steps
Collatz(922524) has sequence length of 445 steps
Collatz(922525) has sequence length of 445 steps
Regards, Mario
 [reply] [d/l] [select] 

This is great, Mario (and everyone else in this thread, for that matter)! The multicore work is fantastic. I'm very impressed by the level of interest and dedication this "little" question generated. Hopefully we can come up with a few more like it. (And anyone can suggest challenges, by the way.)
use strict; use warnings; omitted for brevity.
 [reply] [d/l] 

Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 13:53 UTC

Hi dear fellow monks,
as a followup to my previous post, I implemented my new caching strategy, consisting in storing in the cache the length of the sequences rather than the full sequences.
On the computer where I'm running my tests right now, my original program has the following timing:
$ time perl collatz.pl
837799: 525
626331: 509
[lines omitted for brevity]
906175: 445
922524: 445
922525: 445
real 1m37,551s
user 1m9,375s
sys 0m21,031s
My laptop is obviously much slower than Nick's computer (where this program took 22 seconds).
This is now my first implementation with the new caching strategy:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;
my %cache = (2 => 2);
sub collatz_seq {
my $input = shift;
my $n = $input;
my $result = 0;
while ($n != 1) {
if (exists $cache{$n}) {
$result += $cache{$n};
last;
} else {
my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
$result++;
$cache{$n} = $cache{$new_n} + 1
if defined $cache{$new_n} and $n < MAX;
$n = $new_n;
}
}
$cache{$input} = $result if $input < MAX;
return $result;
}
my @long_seqs;
for my $num (1..1000000) {
my $seq_length = collatz_seq $num;
push @long_seqs, [ $num, $seq_length ] if $seq_length > 400;
}
@long_seqs = sort { $b>[1] <=> $a>[1]} @long_seqs;
say "$_>[0]: $_>[1]" for @long_seqs[0..19];
This program produces the same outcome, but is nearly 3 times faster:
real 0m34,207s
user 0m34,108s
sys 0m0,124s
But we now end up with a cache having essentially one entry per input number in the 1..1_000_000 range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup).
This is the code for this new implementation:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;
my @cache = (0, 1, 2);
sub collatz_seq {
my $input = shift;
my $n = $input;
my $result = 0;
while ($n != 1) {
if (defined $cache[$n]) {
$result += $cache[$n];
last;
} else {
my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
$result++;
$cache[$n] = $cache[$new_n] + 1
if defined $cache[$new_n] and $n < MAX;
$n = $new_n;
}
}
$cache[$input] = $result if $input < MAX;
return $result;
}
my @long_seqs;
for my $num (1..1000000) {
my $seq_length = collatz_seq $num;
push @long_seqs, [ $num, $seq_length ] if $seq_length > 400;
}
@long_seqs = sort { $b>[1] <=> $a>[1]} @long_seqs;
say "$_>[0]: $_>[1]" for @long_seqs[0..19];
With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache):
$ time perl collatz3.pl
837799: 525
626331: 509
[Lines omitted for brevity]
922524: 445
922525: 445
real 0m1,755s
user 0m1,687s
sys 0m0,061s
I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it.
So, it is true that throwing more CPU cores at the problem makes the solution faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution.  [reply] [d/l] [select] 

Hi Laurent_R and fellow monks,
Nice speedup! I applied 3 optimizations in preparation for a followup post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration.
Update: Further improvements, Step 4.
1. Replaced division by 2.
$n >> 1;
2. Removed one level of branching.
while ($n != 1) {
$result += $cache[$n], last
if defined $cache[$n];
my $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1;
$result++;
$cache[$n] = $cache[$new_n] + 1
if defined $cache[$new_n] and $n < $max;
$n = $new_n;
}
3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation).
while ($n != 1) {
$result += $cache[$n], last
if defined $cache[$n];
$n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 )
: ( $result += 1, $new_n = $n >> 1 );
$cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1)
if defined $cache[$new_n] and $n < $max;
$n = $new_n;
}
4. Finally, less caching.
while ($n != 1) {
$result += $cache[$n], last
if defined $cache[$n];
$n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 )
: ( $result += 1, $new_n = $n >> 1 );
$n = $new_n;
}
The final code optionally takes an argument.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
my $max = shift  1e6;
$max = 1e6 if $max < 1e6;
my @cache = (0, 1, 2);
sub collatz_seq {
my $input = shift;
my $n = $input;
my $result = 0;
my $new_n;
while ($n != 1) {
$result += $cache[$n], last
if defined $cache[$n];
$n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 )
: ( $result += 1, $new_n = $n >> 1 );
$n = $new_n;
}
$cache[$input] = $result if $input < $max;
return $result;
}
my @long_seqs;
for my $num (1..$max) {
my $seq_length = collatz_seq $num;
push @long_seqs, [ $num, $seq_length ] if $seq_length > 400;
}
@long_seqs = sort { $b>[1] <=> $a>[1]} @long_seqs;
say "$_>[0]: $_>[1]" for @long_seqs[0..19];
Results:
$ time perl collatz3_a.pl 1e7
# Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1
collatz3_a.pl 1e7 32.291s (a) original code, accepts argument
collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1
collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching
collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations
collatz3_e.pl 1e7 19.357s (e) d + less caching
# AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1
collatz3_a.pl 1e7 13.130s (a) original code, accepts argument
collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1
collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching
collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations
collatz3_e.pl 1e7 7.681s (e) d + less caching
8400511: 686
8865705: 668
6649279: 665
9973919: 663
6674175: 621
7332399: 616
7532665: 616
5649499: 613
8474249: 611
6355687: 608
8847225: 606
9533531: 606
6635419: 603
9953129: 601
7464846: 598
7464847: 598
3732423: 597
5598635: 595
8397953: 593
6298465: 590
Regards, Mario
 [reply] [d/l] [select] 

Hi Mario,
thanks a lot for your comments and suggestions.
I did not expect such microoptimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you.
Following this very interesting thread and your new thread on MCE::Flow + Caching via File::Map, I made a new blog post on the Collatz sequence, trying to summarize some of the findings: http://blogs.perl.org/users/laurent_r/2020/04/revisitingthecollatzsequencepwc54.html. I did not try to explain your demonstration using File::Map for caching with parallel processing (your other thread) because I'm not sure to fully understand everything and was afraid of misrepresenting your work. People can follow the link and read your own words.
Thank you very much for your challenging ideas.
 [reply] [d/l] 

Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)
by vr (Curate) on Apr 21, 2020 at 11:14 UTC

("long read":))
Here's an attempt to solve this Collatz challenge in purely vectorized fashion. As it (slowly) progressed, it soon became apparent the emerging result is somewhat impractical, merely a curiosity; now it's finished (i.e. I can't find how to improve it) and ready to amaze/amuse/frighten the public.
The "PDL" was mentioned in this (or 2 related) threads, but it served just as binary container, "set/at" having same role as "substr + pack/unpack", so it wasn't really "PDL", was it. This node sticks to original task as stated:
calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences
The naive straightforward implementation is simple, it should read without much comment (as "prose":)):
use strict;
use warnings;
use feature 'say';
use PDL;
use Time::HiRes 'time';
my $t = time;
use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;
my $seqs = 1 + sequence( longlong, MAX );
my $lengths = ones( short, MAX );
while ( any my $good_mask = $seqs> inplace
> setvaltobad( 1 )
> isgood ) {
my $odd_mask = $seqs & 1;
$lengths> where( $odd_mask ) ++;
$lengths> where( $good_mask ) ++;
( $seqs> where( $odd_mask ) *= 3 ) ++;
$seqs >>= 1;
}
my $top_i = $lengths> qsorti
> slice([ MAX  1, MAX  TOP ]);
say $lengths> index( $top_i )
> longlong
> cat( $top_i + 1 )
> transpose;
say time  $t;
__END__
[
[ 525 837799]
[ 509 626331]
...
[ 445 938143]
[ 445 906175]
[ 445 922525]
[ 445 922526]
]
7.98023009300232
Above, completed sequences are marked as "BAD", which is just agreedupon value, treated specially. (1) There are builtin methods to check for good/bad, it saves us comparisons while creating masks. (2) More important: BADs kind of taint whatever they interact with (cf. NaN in FP math), which is quite useful.
The timing is sloooooow, yet decent among noncaching solutions around here. Another issue stems from QuickSort being unstable (I recently ranted, in this thread, about neat ordering and "correct 20" extraction).
To fix the latter, there's qsortvec (and qsortveci companion) method to sort 2D array (as opposed to "qsort" for 1D, used above), i.e. 1st on 1st axis, then on 2nd. But here's dilemma: (1) build fullheight (2 x 1e6) array, qsortvec, extract top20. Possible, but, for speed, I'd prefer (2): qsort lengths (as above), extract "many enough" but close to 20, build small "2 x N" array, qsortvec, extract correct (and correctly arranged) top20.
For that, find value at MAX  TOP (445), look left, find how much to extract (22). More fuss: 1st column is to descend, 2nd to ascend  thus temporarily negate (flip bits) one of them. So, huge and unpleasantly looking new "tail" after main loop, in script below, is there to fix top20 extraction. But in fact it adds almost nothing to consumed time.
To improve speed, there are couple of tricks. (1) Where has neat ability to work in list assignment  same mask for several client piddles. (2) Marking 1's as "BAD" on each iteration is redundant. Piddle can be told to treat any value as bad, automatically. Sadly, setting this "any" to "1" won't work here, because $seqs & 1 would then result in something like [BAD 0 BAD 0 BAD 0 ...], regardless of having BADs in $seqs already. Let's mark "2" as bad, so stopper value in sequence would now be 2 instead of 1:
[ BAD BAD 3 4 5 ... ] # initial $seqs
[ 1 2 2 2 2 ... ] # initial $lengths
Other than that, it's the same noncaching approach, with original clarity and simplicity somewhat spoiled by fixes/optimizations:
use strict;
use warnings;
use feature 'say';
use PDL;
use Time::HiRes 'time';
my $t = time;
use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;
my $seqs = 1 + sequence( longlong, MAX );
$seqs> setbadat( 0 );
$seqs> badvalue( 2 );
my $lengths = ones( short, MAX );
$lengths <<= 1;
$lengths> set( 0, 1 );
while ( any my $good_mask = $seqs> isgood ) {
my ( $seqs_odd, $lengths_odd_masked )
= where( $seqs, $lengths, $seqs & 1 );
$lengths_odd_masked ++;
$lengths> where( $good_mask ) ++;
( $seqs_odd *= 3 ) ++;
$seqs >>= 1;
}
my $sorted_i = $lengths> qsorti;
my $sorted = $lengths> index( $sorted_i );
my $value = $sorted> at( MAX  TOP );
my $pos = vsearch_insert_leftmost( $value, $sorted );
my $top_i = $sorted_i> slice([ MAX  1 , $pos ]);
( my $result = $lengths
> index( $top_i )
> longlong
> bitnot
> cat( $top_i + 1 )
> transpose
> qsortvec
> slice([], [ 0, TOP  1 ])
)> slice([ 0 ], [])
> inplace
> bitnot;
say $result;
say time  $t;
__END__
[
[ 525 837799]
[ 509 626331]
...
[ 445 886953]
[ 445 906175]
[ 445 922524]
[ 445 922525]
]
6.0809600353241
So we have correct output and improved time. Good.
Now to something more interesting  let's add caching/lookingup. Because we are to use $seqs as index into $lengths, and indexing starts from 0, let's prepend a dummy 0th element. To kickstart indexing, and because value "2" is occupied to mark "BAD" i.e. sequence stopper, we'll add one more seed element to lengths. Further, lengths will now all start as BAD, and switched to computed values as we go:
[ BAD BAD BAD 3 4 5 6 ... ] # initial $seqs
[ BAD 1 2 BAD 3 BAD BAD ... ] # initial $lengths
(by the way, BAD in $lengths is still the default, for short, 32768)
We'll also maintain $current lengths helper piddle, incremented by 1 or 2 depending on oddity mask of current sequences state. Observe, further, how where calls in list context are (over)abused in code below. (I'm quite aware this code is no longer "a prose to read". Set MAX to 10 and dump primary piddles on each iteration to see what's going on. There are same 3. Other vars are masked views into them.)
use strict;
use warnings;
use feature 'say';
use PDL;
use Time::HiRes 'time';
my $t = time;
use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;
my $seqs = sequence( longlong, 1 + MAX );
$seqs> setbadat( 0 );
$seqs> setbadat( 1 );
$seqs> badvalue( 2 );
my $lengths = ones( short, 1 + MAX );
$lengths> inplace> setvaltobad( 1 );
$lengths> set( 1, 1 );
$lengths> set( 2, 2 );
$lengths> set( 4, 3 );
my $current = zeroes( short, 1 + MAX );
while ( any $seqs> isgood ) { # sic
my ( $seqs_odd, $current_odd_masked )
= where( $seqs, $current, $seqs & 1 );
$current_odd_masked ++;
$current ++;
( $seqs_odd *= 3 ) ++;
$seqs >>= 1;
my ( $seqs_cap, $lengths_cap, $current_cap )
= where( $seqs, $lengths, $current, $seqs <= MAX );
my $lut = $lengths> index( $seqs_cap );
# "_f" is for "finished"
my ( $seqs_f, $lengths_f, $lut_f, $current_f )
= where( $seqs_cap, $lengths_cap, $lut, $current_cap,
$lut> isgood );
$lengths_f .= $lut_f + $current_f;
$seqs_f .= 2; # i.e. BAD
}
$lengths> badflag( 0 );
my $sorted_i = $lengths> qsorti;
my $sorted = $lengths> index( $sorted_i );
my $value = $sorted> at( MAX + 1  TOP );
my $pos = vsearch_insert_leftmost( $value, $sorted );
my $top_i = $sorted_i> slice([ MAX, $pos ]);
( my $result = $lengths
> index( $top_i )
> longlong
> bitnot
> cat( $top_i )
> transpose
> qsortvec
> slice([], [ 0, TOP  1 ])
)> slice([ 0 ], [])
> inplace
> bitnot;
say $result;
say time  $t;
__END__
[
[ 525 837799]
[ 509 626331]
...
[ 445 886953]
[ 445 906175]
[ 445 922524]
[ 445 922525]
]
2.88385105133057
And that's (~2x faster) I'm afraid is as good as it will go. As I understand, cache hits are significantly more rare than with consequential element after element array processing. For comparison, with the same hardware, Laurent_R's final/polished caching solution runs at 1.63s here, if I disable use of "magic number" 400 in there ("magic" constants to crank up performance aren't fair:)), and at 0.88s otherwise.
For 1e7 numbers, running time becomes ~65s, i.e. it gets impractical, like I said. Switching on parallel processing made no difference. Use of multiple cores can be observed for only ~first second, then "viewports" into piddles become increasingly fragmented, work can't be split.
Maybe I did something wrong, and certainly someone can improve even if a little bit, but I'm glad this miniproject is finally off my shoulders.
 [reply] [d/l] [select] 

use strict;
use warnings;
use feature 'say';
use PDL;
use MCE::Flow;
use MCE::Candy;
use Time::HiRes 'time';
my $t = time;
# PDL Quick Reference
# https://www.perlmonks.org/?node_id=1214437
sub collatz_seq {
my ( $chunk_id, $seq_beg, $seq_end ) = @_;
my $max = $seq_end  $seq_beg + 2;
my $top = $max < 20 ? $max : 20;
my $seqs = pdl( longlong, 1, $seq_beg..$seq_end );
$seqs> setbadat( 0 );
$seqs> badvalue( 2 );
my $lengths = 1 + ones( short, $max );
$lengths> set( 0, 1 );
while ( any my $good_mask = $seqs> isgood ) {
my ( $seqs_odd, $lengths_odd_masked )
= where( $seqs, $lengths, $seqs & 1 );
$lengths_odd_masked ++;
$lengths> where( $good_mask ) ++;
( $seqs_odd *= 3 ) ++;
$seqs >>= 1;
}
my $sorted_i = $lengths> qsorti;
my $sorted = $lengths> index( $sorted_i );
my $value = $sorted> at( $max  $top );
my $pos = vsearch_insert_leftmost( $value, $sorted );
my $top_i = $sorted_i> slice([ $max  1 , $pos ]);
( my $result = $lengths
> index( $top_i )
> longlong
> bitnot
> cat( $top_i + 1 )
> transpose
> qsortvec
> slice([], [ 0, $top  1 ])
)> slice([ 0 ], [])
> inplace
> bitnot;
# From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ],
my $str = $result>string;
$str =~ s/(\d+)\s+(\d+)(.*)/$2,$1$3,/g;
my $ret = eval $str;
$_>[0] = $_>[0] + $seq_beg  2 for @$ret;
MCE>gather( $chunk_id, @$ret );
}
my $size = shift  1e6;
$size = 1e6 if $size < 1e6; # minimum
$size = 1e9 if $size > 1e9; # maximum
my $chunk_size = $size >> 5;
my @seqs;
mce_flow_s {
max_workers => MCE::Util::get_ncpu(),
chunk_size => $chunk_size > 100000 ? 100000 : $chunk_size,
bounds_only => 1,
gather => MCE::Candy::out_iter_array(\@seqs),
}, sub {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
collatz_seq( $chunk_id, @{ $chunk_ref } );
}, 2, $size;
MCE::Flow>finish;
@seqs = ( sort { $b>[1] <=> $a>[1]} @seqs )[ 0..19 ];
printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
for @seqs;
say {*STDERR} time  $t;
Results:
I was expecting for mce_pdl2 using 1 core to be closer to vr_pdl2 than vr_pdl3. Maybe benefitting from CPU L2/L3 cache. Chunking seems to be the reason why mce_pdl2 (noncaching) ran nearly as fast as vr_pdl3 (caching). Below, includes 1 core testing for 1e7 with various chunk sizes.
1e7 testing
vr_pdl3: 46.328s cache
vr_pdl2: 1m16.058s noncache
mce_pdl2: 1m15.583s chunk_size => $size
mce_pdl2: 1m03.709s chunk_size => $size >> 1
mce_pdl2: 52.352s chunk_size => $size >> 2
mce_pdl2: 51.323s chunk_size => $size >> 3
mce_pdl2: 49.396s chunk_size => $size >> 4
mce_pdl2: 48.195s chunk_size => $size >> 5
mce_pdl2: 48.369s chunk_size => $size >> 6
mce_pdl2: 48.501s chunk_size => $size >> 7
chunk_size => 300000
mce_pdl2: 48.195s 1 core
mce_pdl2: 25.311s 2 cores
mce_pdl2: 14.085s 4 cores
mce_pdl2: 7.650s 8 cores
mce_pdl2: 4.517s 16 cores
mce_pdl2: 3.721s 32 cores
chunk_size => 100000
mce_pdl2: 48.395s 1 core
mce_pdl2: 25.402s 2 cores
mce_pdl2: 13.163s 4 cores
mce_pdl2: 6.860s 8 cores
mce_pdl2: 3.850s 16 cores
mce_pdl2: 2.347s 32 cores
Output
Collatz(8400511) has sequence length of 686 steps
Collatz(8865705) has sequence length of 668 steps
Collatz(6649279) has sequence length of 665 steps
Collatz(9973919) has sequence length of 663 steps
Collatz(6674175) has sequence length of 621 steps
Collatz(7332399) has sequence length of 616 steps
Collatz(7532665) has sequence length of 616 steps
Collatz(5649499) has sequence length of 613 steps
Collatz(8474249) has sequence length of 611 steps
Collatz(6355687) has sequence length of 608 steps
Collatz(8847225) has sequence length of 606 steps
Collatz(9533531) has sequence length of 606 steps
Collatz(6635419) has sequence length of 603 steps
Collatz(9953129) has sequence length of 601 steps
Collatz(7464846) has sequence length of 598 steps
Collatz(7464847) has sequence length of 598 steps
Collatz(3732423) has sequence length of 597 steps
Collatz(5598635) has sequence length of 595 steps
Collatz(8397953) has sequence length of 593 steps
Collatz(6298465) has sequence length of 590 steps
1e8 testing
vr_pdl3: 9m44.667s cache
vr_pdl2: 16m12.467s noncache
chunk_size => 300000
mce_pdl2: 9m06.078s 1 core
mce_pdl2: 4m43.529s 2 cores
mce_pdl2: 2m33.136s 4 cores
mce_pdl2: 1m21.434s 8 cores
mce_pdl2: 45.266s 16 cores
mce_pdl2: 36.925s 32 cores
chunk_size => 100000
mce_pdl2: 9m09.950s 1 core
mce_pdl2: 4m39.677s 2 cores
mce_pdl2: 2m24.230s 4 cores
mce_pdl2: 1m13.353s 8 cores
mce_pdl2: 37.923s 16 cores
mce_pdl2: 20.099s 32 cores
Output
Collatz(63728127) has sequence length of 950 steps
Collatz(95592191) has sequence length of 948 steps
Collatz(96883183) has sequence length of 811 steps
Collatz(86010015) has sequence length of 798 steps
Collatz(98110761) has sequence length of 749 steps
Collatz(73583070) has sequence length of 746 steps
Collatz(73583071) has sequence length of 746 steps
Collatz(36791535) has sequence length of 745 steps
Collatz(55187303) has sequence length of 743 steps
Collatz(56924955) has sequence length of 743 steps
Collatz(82780955) has sequence length of 741 steps
Collatz(85387433) has sequence length of 741 steps
Collatz(63101607) has sequence length of 738 steps
Collatz(64040575) has sequence length of 738 steps
Collatz(93128574) has sequence length of 736 steps
Collatz(93128575) has sequence length of 736 steps
Collatz(94652411) has sequence length of 736 steps
Collatz(96060863) has sequence length of 736 steps
Collatz(46564287) has sequence length of 735 steps
Collatz(69846431) has sequence length of 733 steps
Specifying 50000 for chunk size may run faster on 4/6/8 core machines.
8 cores
mce_pdl2 1e8 16.660s chunk_size => 300000
mce_pdl2 1e8 10.471s chunk_size => 100000
mce_pdl2 1e8 9.773s chunk_size => 50000
Regards, Mario
 [reply] [d/l] [select] 

Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :).
My "caching" PDL solution (I'd prefer "lookingup" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge:
23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1
Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem.
This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The $current piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element.
For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5.
Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz $lengths for even numbers in range 22..40, also marking these $seqs elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even $seqs element outside 1st chunk as BAD, even if CLs are unknown yet.
2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speedup is very modest, just ~0.5s.
Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically).
If it's cheap to shiftleft indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this.
I tried to add masks/slices/bads for $current, so that e.g. the line $current ++; applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight".
I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up.
use strict;
use warnings;
use feature 'say';
use PDL;
use List::Util;
BEGIN { *_min = \&List::Util::min; # collision
*_max = \&List::Util::max } # with PDL
use constant MAX => 1e7;
use constant TOP => _min( 20, MAX );
use constant CHUNK => _min( 8e4, MAX ); # but keep it even
use constant MAXLEN => MAX * 1; # ?? # x(1..2)
use Time::HiRes 'time';
my $t = time;
my $seqs = sequence( longlong, 1 + MAX );
$seqs> setbadat( 0 );
$seqs> setbadat( 1 );
$seqs> badvalue( 2 );
$seqs> slice([ CHUNK + 2, MAX, 2]) .= 2 ##
if CHUNK + 2 <= MAX; ##
my $lengths = ones( short, 1 + MAXLEN );
$lengths> inplace> setvaltobad( 1 );
$lengths> set( 1, 1 );
$lengths> set( 2, 2 );
$lengths> set( 4, 3 );
CHUNKS:
for ( my $from = my $to = 0; $to != MAX; $from = $to + 1 ) {
$to = _min( $from + CHUNK, MAX );
# "_c" is for "chunk"
my $seqs_c = $seqs> slice([ $from, $to ]);
my $lengths_c = $lengths> slice([ $from, $to ]);
my $current = zeroes( short, nelem( $seqs_c ));
ITERATIONS:
while ( any $seqs_c> isgood ) {
my ( $seqs_c_odd, $current_odd_masked )
= where( $seqs_c, $current, $seqs_c & 1 );
$current_odd_masked ++;
$current ++;
( $seqs_c_odd *= 3 ) ++;
$seqs_c >>= 1;
my ( $seqs_cap, $lengths_cap, $current_cap )
= where( $seqs_c, $lengths_c, $current,
$seqs_c <= MAXLEN );
my $lut = $lengths> index( $seqs_cap );
# "_f" is for "finished"
my ( $seqs_f, $lengths_f, $lut_f, $current_f )
= where( $seqs_cap, $lengths_cap, $lut, $current_cap,
$lut> isgood );
$lengths_f .= $lut_f + $current_f;
$seqs_f .= 2; # i.e. BAD
}
# "_e" is for "at even positions, ahead" ##
##
# my $from_e = _max( $from * 2, $to ) + 2; # bug ##
my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ##
my $to_e = _min( $to * 2, MAXLEN ); ##
##
( $lengths> slice([ $from_e, $to_e, 2 ]) ##
.= $lengths> slice([ $from_e / 2, $to_e / 2 ])) ++ ##
if $from_e <= MAXLEN ##
}
# same finale
$lengths> badflag( 0 );
$lengths = $lengths> slice([ 1, MAX ]);
my $sorted_i = $lengths> qsorti;
my $sorted = $lengths> index( $sorted_i );
my $value = $sorted> at( MAX  TOP );
my $pos = vsearch_insert_leftmost( $value, $sorted );
my $top_i = $sorted_i> slice([ MAX  1, $pos ]);
( my $result = $lengths
> index( $top_i )
> longlong
> bitnot
> cat( $top_i + 1 )
> transpose
> qsortvec
> slice([], [ 0, TOP  1 ])
)> slice([ 0 ], [])
> inplace
> bitnot;
say $result;
say time  $t;
__END__
Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "010,1120,2130, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry.
 [reply] [d/l] [select] 


Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 11:16 UTC

Hi Nick,
for some reason, I somehow missed your very interesting post and the rest of this thread until just now.
In fact, a couple of days after I submitted my solution to the Perl Weekly Challenge and posted my blog post you refer to above, I figured out that my caching strategy was in fact quite inefficient: the program doesn't need to store in the cache the full sequence, it would be enough to just store the number of its items. And that reduces considerably the overhead of the cache.
I did not try this change in Perl so far, but I did it with the Raku solution. Changing the caching strategy made the Raku program 6 times faster. It seems the changes I made are quite similar to what vr suggested. These results are described in another blog that I haven't completed yet, but will be published hopefully soon.
Now that I see that you have brought this up, I feel compelled to implement this change in the Perl version. I don't have time right now, but will come back with my updated version and the relevant timings soon. (Update: see below.)
Thank you anyway for your very interesting post and for the just as interesting discussion it triggered.  [reply] 
A reply falls below the community's threshold of quality. You may see it by logging in. 

