Your skill will accomplishwhat the force of many cannot PerlMonks

### Optimizing with Caching vs. Parallelizing (MCE::Map)

by 1nickt (Canon)
 on Apr 05, 2020 at 15:17 UTC Need Help??

Mon cher ami Laurent_R recently blogged about his solution to the "extra credit" problem in the Perl Weekly Challenge #54. He showed a solution using memoizing, or caching, to reduce the number of repeated calculations made by a program.

I wondered about the strategy. Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow. But the task looks computation-intensive, so I wanted to see whether more cycles would be more beneficial than caching.

Here is the solution presented by Laurent:

```#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use Data::Dumper;
use constant MAX => 300000;

my %cache;

sub collatz_seq {
my \$input = shift;
my \$n = \$input;
my @result;
while (\$n != 1) {
if (exists \$cache{\$n}) {
push @result, @{\$cache{\$n}};
last;
} else {
my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
push @result, \$new_n;
\$cache{\$n} = [\$new_n, @{\$cache{\$new_n}}]
if defined (\$cache{\$new_n}) and \$n < MAX;
\$n = \$new_n;
}
}
\$cache{\$input} = [@result] if \$n < MAX;
return @result;
}

my @long_seqs;
for my \$num (1..1000000) {
my @seq = (\$num, collatz_seq \$num);
push @long_seqs, [ \$num, scalar @seq] if scalar @seq > 400;
}

@long_seqs = sort { \$b->[1] <=> \$a->[1]} @long_seqs;
say  "\$_->[0]: \$_->[1]" for @long_seqs[0..19];
# say "@{\$cache{\$long_seqs[0][0]}}";

This runs on my system pretty quickly:

```real    0m22.596s
user    0m21.530s
sys    0m1.045s

Next I ran the following version using mce_map_s from MCE::Map. mce_map_s is an implementation of the parallelized map functionality provided by MCE::Map, optimized for sequences. Each worker is handed only the beginning and end of the chunk of the sequence it will process, and workers communicate amongst themselves to keep track of the overall task. When using mce_map_s, pass only the beginning and end of the sequence to process (also, optionally, the step interval and format).

```use strict; use warnings; use feature 'say';
use Data::Dumper;
use MCE::Map;

my @output = mce_map_s {
my \$input = \$_;
my \$n = \$input;

my @result = \$input;

while ( \$n != 1 ) {
\$n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
push @result, \$n;
}

return [ \$input, scalar @result ];
} 1, 1000000;

MCE::Map->finish;

@output = sort { \$b->[1] <=> \$a->[1] } @output;

say sprintf('%s : length %s', \$_->[0], \$_->[1]) for @output[0..19];

This program, with no caching, runs on my system about five times faster (I have a total of 12 cores):

```real    0m4.322s
user    0m27.992s
sys    0m0.170s

Notably, reducing the number of workers to just two still ran the program in less than half the time than Laurent's single-process memoized version. Even running with one process, with no cache, was faster. This is no doubt due to the fact MCE uses chunking by default. Even with one worker the list of one million numbers was split by MCE into chunks of 8,000.

Next, I implemented Laurent's cache strategy, but using MCE::Shared::Hash. I wasn't really surprised that the program then ran much slower than either previous version. The reason, of course, is that this task pretty much only makes use of the CPU, so while throwing more cycles at it it a huge boost, sharing data among the workers - precisely because the task is almost 100% CPU-bound - only slows them down. Modern CPUs are very fast at crunching numbers.

I was curious about how busy the cache was in this case, so I wrapped the calls to assign to and read from the hash in Laurent's program in a sub so I could count them. The wrappers look like:

```my %cache;
my \$sets = my \$gets = 0;
sub cache_has { \$gets++; exists \$cache{\$_[0]} }
sub cache_set { \$sets++; \$cache{\$_[0]} = \$_[1] }
sub cache_get { \$gets++; \$cache{\$_[0]} }

The result:

```Sets: 659,948
Gets: 16,261,635
That's a lot of back and forth.

So the moral of the story is that while caching is often useful when you are going to make the same calculations over and over, sometimes the cost of the caching exceeds the cost of just making the calculations repeatedly.

Hope this is of interest!

Replies are listed 'Best First'.
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by marioroy (Prior) on Apr 06, 2020 at 05:55 UTC

Hi 1nickt,

One may simulate extra CPU time to better understand the benefit of caching. I converted Laurent_R's example to consume multiple CPU cores. Laurent's example incurs extra overhead due to calling a subroutine inside the loop and returning a full list. I changed the number of iterations from 1 million to 100 thousand now that simulating extra CPU cycles.

Nick's example

```use strict;
use warnings;
use feature 'say';
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes 'usleep';

my @output = mce_map_s {
my \$input = \$_;
my \$n = \$input;

my @result = \$input;

while ( \$n != 1 ) {
\$n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
usleep 20; # simulate extra computation
push @result, \$n;
}

return [ \$input, scalar @result ];
} 1, 100000;

MCE::Map->finish;

@output = sort { \$b->[1] <=> \$a->[1] } @output;

say sprintf('%s : length %s', \$_->[0], \$_->[1]) for @output[0..19];

Laurent's example - no caching

```use strict;
use warnings;
use feature qw/say/;
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes qw/usleep/;

sub collatz_seq {
my \$n = shift;
my @result;
while (\$n != 1) {
my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
usleep 20; # simulate extra computation
push @result, \$new_n;
\$n = \$new_n;
}
return @result;
}

my @long_seqs = mce_map_s {
my \$num = \$_;
my @seq = (\$num, collatz_seq \$num);
return [ \$num, scalar @seq ];
} 1, 100000;

MCE::Map->finish;

@long_seqs = sort { \$b->[1] <=> \$a->[1] } @long_seqs;
say  "\$_->[0] : length \$_->[1]" for @long_seqs[0..19];

Laurent's example - with caching

```use strict;
use warnings;
use feature qw/say/;
use constant MAX => 300000;
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();
use Time::HiRes qw/usleep/;

my %cache;

sub collatz_seq {
my \$input = shift;
my \$n = \$input;
my @result;
while (\$n != 1) {
if (exists \$cache{\$n}) {
push @result, @{ \$cache{\$n} };
last;
}
my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
usleep 20; # simulate extra computation
push @result, \$new_n;
\$cache{\$n} = [ \$new_n, @{ \$cache{\$new_n} } ]
if \$n < MAX && exists \$cache{\$new_n};
\$n = \$new_n;
}
\$cache{\$input} = \@result if \$n < MAX;
return @result;
}

my @long_seqs = mce_map_s {
my \$num = \$_;
my @seq = (\$num, collatz_seq \$num);
return [ \$num, scalar @seq ];
} 1, 100000;

MCE::Map->finish;

@long_seqs = sort { \$b->[1] <=> \$a->[1] } @long_seqs;
say  "\$_->[0] : length \$_->[1]" for @long_seqs[0..19];

Program output

```77031 : length 351
52527 : length 340
78791 : length 338
60975 : length 335
87087 : length 333
88059 : length 333
91463 : length 333
63387 : length 330
95081 : length 328
99067 : length 328
99721 : length 328
71310 : length 325
71311 : length 325
74791 : length 325
74793 : length 325
35655 : length 324
53483 : length 322
56095 : length 322
80225 : length 320
81159 : length 320

Time to run with 12 cores using a Linux VM

Notice the extra 4 seconds user time (likely function overhead inside loop), but not felt in real time (wall clock time).

```   Nick's example:  1m32.958s real, 0m56.561s user - no caching
Laurent's example:  1m33.454s real, 1m00.491s user - no caching
Laurent's example:  0m18.761s real, 0m20.990s user -    caching

Caching is helpful here (5x faster) because I added sleep to simulate more CPU cycles. However, caching is not helpful when the overhead involved is greater than the computation itself as demonstrated by 1nickt's comparison above.

Regards, Mario

Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by davido (Cardinal) on Apr 05, 2020 at 21:47 UTC

Hi Nick.

I wonder if you could use an algorithm that could actually benefit both from parallel processing and from caching.

If you take your data set and interleaf it /8 or /12 (for your 12 cores), and then do the caching within each interleaf, you will have to regenerate the cache for each worker, but each worker maintains its own cache so that there's no per-iteration write back.

```           iteration 1    iteration 2    iteration 3    iteration 4
worker 1       1               9              17              25
worker 2       2              10              18              26
worker 3       3              11              19              27
worker 4       4              12              20              28
worker 5       5              13              21              29
worker 6       6              14              22              30
worker 7       7              15              23              31
worker 8       8              16              24              32
```

So worker 1 gets 1, 9, 17, 25, ... to work on. Worker 2 gets 2, 10, 18, 26. Worker 3 gets 3, 11, 19, 27. And so on.

Within each worker implement caching. If I were using Parallel::ForkManager I would have them each return a structure identifying the worker starting offset, and the sequences. That way you don't have to compose them back into a master structure at the end for output, you can sequence them round-robin based on the starting offset of each.

I've sat down a few times today with the intent to give it a try, but keep getting interrupted before I get very far into it. So it's probably not going to happen.

Interleafing ought to allow each worker to build a cache pretty quickly.

Dave

Hi davido,

I did indeed try that approach, and while I did not interleave the sequence, as I mentioned MCE chunks it up. (I don't think interleaving would really much affect the cache hits, as the caching tested stores the sequence for each number found, and the algorithm produces sequences that themselves include numbers greater than n, so such a boundary would be difficult to enforce.)

In testing, with each worker building and using its own cache for the chuck processed, the program ran almost three times slower than just letting the workers hammer the CPU. Again, I believe it's because in this case the overhead of caching outweighs its benefits.

The way forward always starts with a minimal test.
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?)
by vr (Curate) on Apr 07, 2020 at 12:43 UTC

FWIW, here's a better single-threaded caching algorithm, which at my machine with 4 puny little cores outperforms others I checked:

```use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';

use Time::HiRes 'time';
my \$t = time;

my %cache = ( 1 => [ 1, 1, undef ]);
#
# [ seq_length, start_num (same as key), next_num ]
#

sub _cache_collatz {
my \$n = shift;
return if exists \$cache{ \$n };
my ( @tmp , \$new_n );
while ( !exists \$cache{ \$n }) {
\$new_n = \$n % 2 ? 3 * \$n + 1 : \$n >> 1;
push @tmp, \$cache{ \$n } = [ -\$#tmp, \$n, \$new_n ];
\$n = \$new_n
}
my \$d = \$#tmp + \$cache{ \$n }[ 0 ];
\$_-> [ 0 ] += \$d for @tmp
}

sub collatz {
my \$n = shift;
_cache_collatz( \$n );
my @a = \$n;
push @a, \$n while \$n = \$cache{ \$n }[ 2 ];
return \@a;
}

## Demo
#
# dd collatz( 23 );
# dd \%cache;
# exit( 0 );
#
##

_cache_collatz( \$_ ) for 1 .. 1e6;

say "@\$_[ 1, 0 ]"
for ( sort { \$b-> [ 0 ] <=> \$a-> [ 0 ]}
@cache{ 1 .. 1e6 })[ 0 .. 19 ];
say time - \$t;

__END__

837799 525
626331 509
939497 507
704623 504
910107 476
927003 476
511935 470
767903 468
796095 468
970599 458
546681 452
818943 450
820022 450
820023 450
410011 449
615017 447
886953 445
906175 445
922524 445
922525 445
6.44625306129456

For comparison:

Apparently, the last one will beat my "better caching" if run on 6 or more cores, but, like I said, I have only 4 :)

Hi vr,

Wow! Curiosity moved me to try with MCE. But first, I optimized Nick's code. Also, workers now send only the relevant data and not the entire list. This decreases the time needed for the manager process to output results.

Update 1: Slight optimization to Choroba's example (i.e. >> 1). Added collatz_compress.
Update 3: Improved collatz_count.
Update 4: Added Inline C times.

Demo choroba, 1nickt, compress and count all in one.

```use strict;
use warnings;
use feature 'say';
use MCE::Flow;

sub collatz_choroba {
my (\$n) = @_;
my @seq = \$n;

push @seq, ( \$seq[-1] >> 1, 3 * \$seq[-1] + 1 )[ \$seq[-1] % 2 ]
while \$seq[-1] != 1;

return scalar @seq;
}

sub collatz_1nickt {
my (\$n) = @_;
my @seq = \$n;

push @seq, \$n = \$n % 2 ? 3 * \$n + 1 : \$n >> 1
while \$n != 1;

return scalar @seq;
}

sub collatz_compress {
my (\$n) = @_;
my @seq = \$n;

# Notation and compressed dynamics (2:15 into the video)

# consider the map
# C(x) = \$n % 2 ? 3 * \$n + 1 : \$n / 2

# compress the map to the dynamically equivalent
# T(x) = \$n % 2 ? (3 * \$n + 1) / 2 : \$n / 2

# compress loop iteration when odd sequence
push @seq, \$n % 2 ? ( \$n = 3 * \$n + 1, \$n = \$n >> 1 )
: ( \$n = \$n >> 1 )
while \$n != 1;

return scalar @seq;
}

sub collatz_count {
my (\$n) = @_;
my \$steps = 1;

# count using the T(x) notation
\$n % 2 ? ( \$steps += 2, \$n = (3 * \$n + 1) >> 1 )
: ( \$steps += 1, \$n = \$n >> 1 )
while \$n != 1;

return \$steps;
}

#*collatz = \&collatz_choroba;   # choose collatz here
#*collatz = \&collatz_1nickt;
#*collatz = \&collatz_compress;
*collatz = \&collatz_count;

my @sizes;

MCE::Flow->init(
max_workers => MCE::Util::get_ncpu(),
gather      => \@sizes,
bounds_only => 1,
);

mce_flow_s sub {
my (\$start, \$end) = @{ \$_[1] };
my @ret = map { [ \$_, collatz(\$_) ] } \$start .. \$end;
MCE->gather( ( sort { \$b->[1] <=> \$a->[1] } @ret )[ 0..19 ] );
}, 1, 1e6;

MCE::Flow->finish;

say "@\$_"
for ( sort { \$b->[1] <=> \$a->[1] } @sizes )[ 0..19 ];

Demo caching by vr

```use strict;
use warnings;
use feature 'say';
use MCE::Flow;

my %cache = ( 1 => [ 1, 1, undef ]);
#
# [ seq_length, start_num (same as key), next_num ]
#

sub collatz_vr {
my \$n = shift;
return if exists \$cache{ \$n };

my ( @tmp, \$new_n );
while ( !exists \$cache{ \$n } ) {
\$new_n = \$n % 2 ? 3 * \$n + 1 : \$n >> 1;
push @tmp, \$cache{ \$n } = [ -\$#tmp, \$n, \$new_n ];
\$n = \$new_n
}

my \$d = \$#tmp + \$cache{ \$n }[ 0 ];
\$_->[ 0 ] += \$d for @tmp;
}

my @sizes;

mce_flow_s {
max_workers => MCE::Util::get_ncpu(),
gather      => \@sizes,
bounds_only => 1,
}, sub {
my (\$start, \$end) = @{ \$_[1] };
collatz_vr(\$_) for \$start .. \$end;
MCE->gather(
( sort { \$b->[0] <=> \$a->[0] } @cache{ \$start .. \$end } )[ 0 .
+. 19 ]
);
}, 1, 1e6;

MCE::Flow->finish;

say "@\$_[ 1, 0 ]"
for ( sort { \$b->[0] <=> \$a->[0] } @sizes )[ 0 .. 19 ];

Run time on a 32 core machine - SMT disabled

The times include launching Perl, loading modules, spawning workers, reaping workers, and output (~ 0.100 seconds). See the next post for the Inline C demonstration.

```max_workers => 1
collatz_vr         3.617s    1.0x
collatz_choroba   14.896s    1.0x
collatz_1nickt    12.264s    1.0x
collatz_compress   8.932s    1.0x
collatz_count      7.021s    1.0x
collatz_count_c1   0.741s    1.0x  Inline C
collatz_count_c2   0.625s    1.0x  Inline C

max_workers => 2
collatz_vr         2.707s    1.3x
collatz_choroba    7.538s    2.0x
collatz_1nickt     6.141s    2.0x
collatz_compress   4.490s    2.0x
collatz_count      3.537s    2.0x
collatz_count_c1   0.398s    1.9x  Inline C
collatz_count_c2   0.339s    1.8x  Inline C

max_workers => 4
collatz_vr         1.906s    1.9x
collatz_choroba    3.948s    3.8x
collatz_1nickt     3.274s    3.7x
collatz_compress   2.365s    3.8x
collatz_count      1.873s    3.7x
collatz_count_c1   0.235s    3.2x  Inline C
collatz_count_c2   0.203s    3.1x  Inline C

max_workers => 8
collatz_vr         1.458s    2.5x
collatz_choroba    1.995s    7.5x
collatz_1nickt     1.655s    7.4x
collatz_compress   1.209s    7.4x
collatz_count      0.961s    7.3x
collatz_count_c1   0.149s    5.0x  Inline C
collatz_count_c2   0.132s    4.7x  Inline C

max_workers => 16
collatz_vr         0.976s    3.7x
collatz_choroba    1.034s   14.4x
collatz_1nickt     0.858s   14.3x
collatz_compress   0.637s   14.0x
collatz_count      0.509s   13.8x
collatz_count_c1   0.114s    6.5x  Inline C
collatz_count_c2   0.105s    6.0x  Inline C

max_workers => 32
collatz_vr         0.837s    4.3x
collatz_choroba    0.593s   25.1x
collatz_1nickt     0.497s   24.7x
collatz_compress   0.379s   23.6x
collatz_count      0.310s   22.6x
collatz_count_c1   0.112s    6.6x  Inline C
collatz_count_c2   0.108s    5.8x  Inline C

Interesting. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling versus the non-cache demonstrations.

collatz_count (2 cores) reaches collatz_vr (1 core).
collatz_count (4 cores) reaches collatz_vr (4 cores).

Kind regards, Mario

Hi,

There was one thing left to try and that is using Inline::C. I captured the top 20 for 1e9 and 1e10.

Update 1: Set chunk_size to reduce IPC. Tried staying within CPU cache by keeping @ret small. 1e10 now completes in 4 minutes (20% improvement).

Update 2: Added collatz_count_c2 using GCC compiler intrinsics.

```use strict;
use warnings;
use feature 'say';
use MCE::Flow;

use Inline C => Config =>
CCFLAGSEX => '-O2 -fomit-frame-pointer', clean_after_build => 0;

use Inline C => <<'END_OF_C_CODE';

#include <stdlib.h>
#include <stdint.h>

#if defined(_WIN32)
#define strtoull _strtoui64
#endif

void collatz_count_c1( SV* _n )
{
uint64_t n, steps = 1;  /* include 1 */
Inline_Stack_Vars;

#ifdef __LP64__
n = SvUV( _n );
#else
n = strtoull( SvPV_nolen( _n ), NULL, 10 );
#endif

/* count using the T(x) notation */
do {
n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 )
: ( steps += 1, n = n >> 1 );
} while ( n != 1 );

Inline_Stack_Reset;
Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) );
Inline_Stack_Done;
}

void collatz_count_c2( SV* _n )
{
uint64_t n, steps = 1;  /* include 1 */
Inline_Stack_Vars;

#ifdef __LP64__
n = SvUV( _n );
#else
n = strtoull( SvPV_nolen( _n ), NULL, 10 );
#endif

/* based on GCC compiler intrinsics demonstration by Alex Shirley
+*/
/* https://stackoverflow.com/questions/38114205/c-collatz-conjectu
+re-optimization */
/* https://www.go4expert.com/articles/builtin-gcc-functions-builti
+nclz-t29238 */

if ( n % 2 == 0 ) {
steps += __builtin_ctz(n); /* account for all evens */
n >>= __builtin_ctz(n);    /* always returns an odd */
}

/* when we enter we're always working on an odd number */
do {
n = 3 * n + 1;
steps += __builtin_ctz(n) + 1; /* account for odd and even */
n >>= __builtin_ctz(n);        /* always returns an odd */
} while ( n != 1 );

Inline_Stack_Reset;
Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) );
Inline_Stack_Done;
}

END_OF_C_CODE

sub collatz_count {
my (\$n) = @_;
my \$steps = 1;  # count 1

# count using the T(x) notation
\$n % 2 ? ( \$steps += 2, \$n = (3 * \$n + 1) >> 1 )
: ( \$steps += 1, \$n = \$n >> 1 )
while \$n != 1;

return \$steps;
}

no warnings 'once';

#*collatz = \&collatz_count;     # choose collatz here
#*collatz = \&collatz_count_c1;  # using the T(x) notation
*collatz = \&collatz_count_c2;  # using compiler intrinsics

my \$m = shift || 1e6;
my ( @sizes, \$chunk_size );

\$chunk_size  = int( \$m / MCE::Util::get_ncpu() / 40 + 1 );
\$chunk_size += 1 if \$chunk_size % 2;

MCE::Flow->init(
max_workers => MCE::Util::get_ncpu(),
chunk_size  => \$chunk_size,
gather      => \@sizes,
bounds_only => 1,
);

mce_flow_s sub {
my ( \$start, \$end ) = @{ \$_[1] }; my @ret;

for ( \$start .. \$end ) {
push @ret, [ \$_, collatz(\$_) ];
@ret = ( sort { \$b->[1] <=> \$a->[1] } @ret )[ 0..19 ]
if @ret > 100;
}

( @ret > 20 )
? MCE->gather( ( sort { \$b->[1] <=> \$a->[1] } @ret )[ 0..19 ]
+)
: MCE->gather( @ret );
}, 1, \$m;

MCE::Flow->finish;

say "@\$_"
for ( sort { \$b->[1] <=> \$a->[1] } @sizes )[ 0..19 ];

Output

```1..1e9

collatz_count      6m10.877s  32 cores  Perl

collatz_count_c1  11m20.134s   1 core   Inline C
collatz_count_c2   8m56.683s   1 core   Inline C

collatz_count_c1   0m23.443s  32 cores  29.0x
collatz_count_c2   0m18.507s  32 cores  29.0x

670617279 987
848749995 977
954843745 972
954843751 972
716132809 969
537099606 966
537099607 966
268549803 965
805649409 964
805649410 964
805649411 964
805649415 964
402824705 963
604237057 961
604237058 961
604237059 961
302118529 960
906355586 959
906355587 959
906355588 959

1..1e10, 32 cores, Inline C

collatz_count_c1   4m00.323s
collatz_count_c2   3m09.774s

9780657630 1133
9780657631 1133
4890328815 1132
7335493223 1130
6189322407 1122
9283983611 1120
8140184737 1094
6105138553 1091
9157707830 1089
9157707831 1089
4578853915 1088
6868280873 1086
5151210655 1083
7726815983 1081
9282648843 1058
6961986633 1055
5221489974 1052
5221489975 1052
2610744987 1051
7832234962 1050

1..1e11, 32 cores, Inline C

collatz_count_c1  40m50.177s
collatz_count_c2  32m17.319s

75128138247 1229
84519155529 1224
85833820801 1224
63389366646 1221
63389366647 1221
64375365601 1221
31694683323 1220
95084049969 1219
95084049970 1219
95084049971 1219
96563048402 1219
96563048403 1219
47542024985 1218
48281524201 1218
71313037476 1216
71313037477 1216
71313037478 1216
71313037479 1216
72422286302 1216
72422286303 1216

Look at C go!

Regards, Mario

Hi marioroy,

I think I'm having an issue with MCE on Windows here, timing your code (collatz_vr):

```use Time::HiRes 'time';
my \$t = time;

# the whole script here

say time - \$t;
MCE::Flow->finish;
say time - \$t;

__END__

1 worker:
6.01482510566711
7.76495599746704

2 workers:
4.12953305244446
7.07751798629761

4 workers:
3.33010196685791
8.4802930355072

1st measurement approximately matches your output, but 1.5 - 2 seconds per worker to shutdown doesn't look OK to me.

For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling as well versus the non-cache demonstrations

In other words, lots and lots of work is needlessly duplicated.

```_cache_collatz( \$_ ) for 1 .. 1e6;
say scalar %cache;

%cache = ();
_cache_collatz( \$_ ) for 1 + 4e5 .. 1e6;
say scalar %cache;

__END__

2168611
2168611

So, effectively, in situation with e.g. 10 workers, 4 junior workers, filling cache for ranges up to 400_000, are free to slack off and not gather their results at all, and there's large amount of overlap in work of 6 senior workers, too. For now, I have no solid idea how to parallelize this algorithm efficiently.

Impressive. ++

The way forward always starts with a minimal test.
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by choroba (Cardinal) on Apr 06, 2020 at 09:32 UTC
> Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow

map{substr\$_->[0],\$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Hi choroba,

I tried using multiple cores for your demonstration. Perl has this capability. So little effort and practically transparent. :)

```#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use MCE::Util;
use MCE::Map max_workers => MCE::Util::get_ncpu();

sub collatz {
my (\$start) = @_;
my @seq = \$start;
push @seq, (\$seq[-1] / 2, 3 * \$seq[-1] + 1)[\$seq[-1] % 2]
while \$seq[-1] != 1;
return @seq
}

my @sizes = mce_map_s { [\$_, scalar collatz(\$_)] } 1, 1e6;

say "@\$_" for reverse +(sort { \$b->[1] <=> \$a->[1] } @sizes)[0 .. 19];

Before and after on a 6 core machine with hyper-threading - 12 logical cores.

```Serial    48.170 seconds
Parallel   9.361 seconds

Regards, Mario

Interesting! Does mce_map_s return the elements in the corresponding order? As you can see, it's not needed here, so if there's a faster method that doesn't care about the order, we can also switch to that.

map{substr\$_->[0],\$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by rjt (Curate) on Apr 19, 2020 at 23:31 UTC

Great discussions! I'm sorry I missed all this; I haven't had much time to visit here recently.

I'm the one who contributed the task, so I have some interest in this beyond my own personal fascination with the Collatz sequence.

My own solution uses memoization and a couple of other optimizations. The full code is at https://github.com/manwar/perlweeklychallenge-club/blob/master/challenge-054/ryan-thompson/perl/ch-2.pl, and my "review" of my solution is at https://perlweeklychallenge.org/blog/review-challenge-054/#ryan-thompson2. That page also contains links to, and reviews of, every solution submitted by the participants in the challenge that week. None of them used MCE::Map, which is a shame!

Data structures and bookkeeping were key to good performance on this one, for me:

```my @seqlen = (-1,1);    # Memoize sequence length
my \$top    = 20;        # Report this many of the top sequences
my @top    = [ -1,-1 ]; # Top \$top sequences
my \$upper  = 1e6;       # Upper limit starting term
my \$mintop = 0;         # Lowest value in @top

GetOptions('top=i' => \\$top, 'upper=i' => \\$upper);

Here is the main loop:

```for (my \$start = 3; \$start < \$upper; \$start += 2) {
my (\$n, \$len) = (\$start, 0);
while (! defined \$seqlen[\$n]) {
\$len += 1 + \$n % 2;
\$n = \$n % 2 ? (3*\$n + 1)/2 : \$n / 2;
}
\$len += \$seqlen[\$n];
\$seqlen[\$start] = \$len if \$start < \$upper * 2; # Cache
top(\$start => \$len)            if \$len > \$mintop  and     \$start <
+= \$upper;
top(\$n * 2 => \$seqlen[\$n] + 1) if   \$n < \$upper/2 and \$seqlen[\$n]
+> \$mintop;
}

I avoid function call overhead by making sure everything is done iteratively. As another optimization, instead of simply doing 3n+1 for odd numbers, I do 3n+1/2, and increment the sequence length by two instead of one. And finally, I'm able to skip even-numbered starts with a little creative arithmetic with my second call to top().

I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself. Maintaining it naively by calling sort on a million elements at the end takes longer than the above loop, and sorting a 20-item list repeatedly is even worse. Maintaining essentially a priority queue is much faster:

```sub top {
my (\$n, \$len) = @_;

my \$idx = first { \$top[\$_][1] < \$len } 0..\$#top;
splice @top, \$idx, 0, [ \$n, \$len ];

pop @top if @top > \$top;
\$mintop = \$top[-1][1];
}

The above sub is O(n), so it's not as good as a heap implementation, but it's only called when there is definitely a new element to be inserted, thanks to a bit of bookkeeping in \$mintop, so I opted to keep it simple.

Performance:

```real    0m0.848s
user    0m0.835s
sys     0m0.012s

Purely for crude CPU comparison purposes, Laurent's solution in Re: Optimizing with Caching vs. Parallelizing (MCE::Map) runs in 1.57 sec on the same (virtual) machine.

use strict; use warnings; omitted for brevity.

Many thanks for providing this wonderful challenge! If all goes at the pace it's rolling now, the fun may well continue into the month of May! Therefore, a warning: some people are at danger of gaining (or loosing) more than what they have bargained for!:)

I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself

Well said, dear rjt, well said. But it seems it were YOU, who fell into the trap that you so cunningly crafted for poor innocent learners! The top 20 out of million Collatz sequences has an insidious property: there are six "445" lengths in 1e6, but only four in top-20 (so, all six are in top-22). Which to extract? Aha!

Well, the challenge did not clearly state how to order numbers with the same Collatz lengths (CL). But I think it is reasonable to assume, that, since numbers with longer CL are rated better/closer to top, then smaller numbers among producing same CL are to be valued more. Like: "Look! This brave little number commendably creates as long CL as that huge number! And this undeservedly huge number has only managed to generate so puny CL! Loser!"

At least, there must be some consistency in arranging results, don't you agree? In other words, I think that if CL column descends, then numbers column must ascend (for equal CLs). See ordering for CL 450, too.

Well, dear Perl users, I happened to notice this, because in another my (unpublished) solution I had to endure great pain in arranging top-20 properly. Just how many of you, looking at output of (almost all) scripts in this and related 2 threads, have noticed, that top 20 are neatly ordered? And yet, this "natural" result comes at no cost at all, with Perl! Just appreciate what you so ungratefully consume!:)

To illustrate, rjt, here's, in parallel, output of your script and Laurent_R's with marioroy fixes:

```Collatz(837799) has sequence length of 525 steps | 837799: 525
Collatz(626331) has sequence length of 509 steps | 626331: 509
Collatz(939497) has sequence length of 507 steps | 939497: 507
Collatz(704623) has sequence length of 504 steps | 704623: 504
Collatz(910107) has sequence length of 476 steps | 910107: 476
Collatz(927003) has sequence length of 476 steps | 927003: 476
Collatz(511935) has sequence length of 470 steps | 511935: 470
Collatz(767903) has sequence length of 468 steps | 767903: 468
Collatz(796095) has sequence length of 468 steps | 796095: 468
Collatz(970599) has sequence length of 458 steps | 970599: 458
Collatz(546681) has sequence length of 452 steps | 546681: 452
Collatz(820022) has sequence length of 450 steps | 818943: 450
Collatz(818943) has sequence length of 450 steps | 820022: 450
Collatz(820023) has sequence length of 450 steps | 820023: 450
Collatz(410011) has sequence length of 449 steps | 410011: 449
Collatz(615017) has sequence length of 447 steps | 615017: 447
Collatz(922526) has sequence length of 445 steps | 886953: 445
Collatz(922526) has sequence length of 445 steps | 906175: 445
Collatz(886953) has sequence length of 445 steps | 922524: 445
Collatz(906175) has sequence length of 445 steps | 922525: 445

But wait... What's that??? The 922526 number is listed twice, on the left?? Is this... can't believe... is this because of infamous What Every Computer Scientist Should Know About Floating-Point Arithmetic? Because someone:) decided to cut corners and resort to FP? Or is it for another reason? Didn't investigate yet. Just who would have thought that this would surface in so innocent task "calculate lengths of Collatz sequences". Wow! Great challenge!

Edit. No, of course it's not floating point issue. Algo is broken. CLs for odd numbers are cached, but for even numbers they are not. Some even numbers are never passed to &top (e.g. 922524), others are pumped into this subroutine several times. The 922526 hadn't been phased out from @top by odd numbers with longer CLs. With \$top large enough, there are many even dupes.

Ha! Good eye. I didn't even notice the doubled-up 922526. It's not a FP bug. Rather, it's a corner case thanks to how I combined the /2 optimization + memoization; certain even numbers get added to the p.queue twice. It can be fixed trivially by either removing the /2 optimization (simpler, ~5% penalty), or skipping seen numbers in the second call to top() (no measurable penalty, adds a variable).

As to your interpretation of the "top 20 arrangement," I like your discussion! We try to keep the task descriptions only quasi-formal, to keep the challenge accessible to beginners, which is why you don't usually see these sorts of details specified like a requirements document. Meaning, many "minor" details are left to the discretion of the participants. The upshot of that is, if you submit a really weird interpretation, you'd probably net yourself a mildly amusing comment in my next review, at least. :-)

use strict; use warnings; omitted for brevity.

Hi rjt,

Thank you for this challenge. This consumed so much of my time in a great way. The reason is partly due to, "What if possible for many CPU cores?" But first made attempts for fast using 1 core. Below are the 3 progressive solutions, each one running faster.

Update: Added results from two machines.

```#!/usr/bin/env perl
use strict;
use warnings;

my \$size = shift || 1e6;

\$size = 1e6 if \$size < 1e6;  # minimum
\$size = 1e9 if \$size > 1e9;  # maximum

##
#   https://www.perlmonks.org/?node_id=11115520
#   https://www.perlmonks.org/?node_id=11115540
#
# Parallel solution
#   https://www.perlmonks.org/?node_id=11115544
##

my @cache = (0, 1, 2);
my @seqs;

sub collatz_seq {
my \$size = shift;
my (\$n, \$steps);
for my \$input (2..\$size) {
\$n = \$input, \$steps = 0;
while (\$n != 1) {
\$steps += \$cache[\$n], last if defined \$cache[\$n];
\$n % 2 ? ( \$steps += 2, \$n = (3 * \$n + 1) >> 1 )
: ( \$steps += 1, \$n = \$n >> 1 );
}
\$cache[\$input] = \$steps if \$input < \$size;
push @seqs, [ \$input, \$steps ] if \$steps > 400;
}
}

collatz_seq(\$size);
@seqs = ( sort { \$b->[1] <=> \$a->[1]} @seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @\$_
for @seqs;

iM71's C++ demonstration converted to Perl plus updates:

```#!/usr/bin/env perl
use strict;
use warnings;

my \$size = shift || 1e6;

\$size = 1e6 if \$size < 1e6;  # minimum
\$size = 1e9 if \$size > 1e9;  # maximum

##
# iM71's demonstration + applied T(x) notation and compression
#   https://stackoverflow.com/a/55361008
#   https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video)
#
# Parallel solution
#   https://www.perlmonks.org/?node_id=11115780
##

my @cache = (0, 1, 2);
my @seqs;

sub collatz_seq {
my \$size = shift;
my (\$n, \$steps);
for my \$input (2..\$size) {
\$n = \$input, \$steps = 0;
\$n % 2 ? ( \$steps += 2, \$n = (3 * \$n + 1) >> 1 )
: ( \$steps += 1, \$n = \$n >> 1 )
while \$n != 1 && \$n >= \$input;

\$cache[\$input] = \$steps += \$cache[\$n];
push @seqs, [ \$input, \$steps ] if \$steps > 400;
}
}

collatz_seq(\$size);
@seqs = ( sort { \$b->[1] <=> \$a->[1]} @seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @\$_
for @seqs;

Step counting using Inline C:

```#!/usr/bin/env perl
use strict;
use warnings;

use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer';
use Inline C => <<'END_OF_C_CODE';

#include <stdint.h>

void num_steps_c( SV* _n, SV* _s )
{
uint64_t n, input;
int steps = 0;

n = input = SvUV(_n);

while ( n != 1 && n >= input ) {
n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 )
: ( steps += 1, n = n >> 1 );
}

sv_setuv(_n, n);
sv_setiv(_s, steps);

return;
}

END_OF_C_CODE

my \$size = shift || 1e6;

\$size = 1e6 if \$size < 1e6;  # minimum
\$size = 1e9 if \$size > 1e9;  # maximum

##
# iM71's demonstration + applied T(x) notation and compression
#   https://stackoverflow.com/a/55361008
#   https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video)
#
# Parallel solution
#   https://www.perlmonks.org/?node_id=11115780
##

my @cache = (0, 1, 2);
my @seqs;

sub collatz_seq {
my \$size = shift;
my (\$n, \$steps);
for my \$input (2..\$size) {
num_steps_c(\$n = \$input, \$steps);
\$cache[\$input] = \$steps += \$cache[\$n];
push @seqs, [ \$input, \$steps ] if \$steps > 400;
}
}

collatz_seq(\$size);
@seqs = ( sort { \$b->[1] <=> \$a->[1]} @seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @\$_
for @seqs;

Results from two machines:

```64-bit VM:
rjt   0.903s
Step counting in C   0.273s (1st time involves compiling)

AMD 3970x:
rjt   0.635s
Step counting in C   0.191s (1st time involves compiling)

Collatz(837799) has sequence length of 525 steps
Collatz(626331) has sequence length of 509 steps
Collatz(939497) has sequence length of 507 steps
Collatz(704623) has sequence length of 504 steps
Collatz(910107) has sequence length of 476 steps
Collatz(927003) has sequence length of 476 steps
Collatz(511935) has sequence length of 470 steps
Collatz(767903) has sequence length of 468 steps
Collatz(796095) has sequence length of 468 steps
Collatz(970599) has sequence length of 458 steps
Collatz(546681) has sequence length of 452 steps
Collatz(818943) has sequence length of 450 steps
Collatz(820022) has sequence length of 450 steps
Collatz(820023) has sequence length of 450 steps
Collatz(410011) has sequence length of 449 steps
Collatz(615017) has sequence length of 447 steps
Collatz(886953) has sequence length of 445 steps
Collatz(906175) has sequence length of 445 steps
Collatz(922524) has sequence length of 445 steps
Collatz(922525) has sequence length of 445 steps

Regards, Mario

This is great, Mario (and everyone else in this thread, for that matter)! The multicore work is fantastic. I'm very impressed by the level of interest and dedication this "little" question generated. Hopefully we can come up with a few more like it. (And anyone can suggest challenges, by the way.)

use strict; use warnings; omitted for brevity.
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 13:53 UTC
Hi dear fellow monks,

as a follow-up to my previous post, I implemented my new caching strategy, consisting in storing in the cache the length of the sequences rather than the full sequences.

On the computer where I'm running my tests right now, my original program has the following timing:

```\$ time perl collatz.pl
837799: 525
626331: 509
[lines omitted for brevity]
906175: 445
922524: 445
922525: 445

real    1m37,551s
user    1m9,375s
sys     0m21,031s
My laptop is obviously much slower than Nick's computer (where this program took 22 seconds).

This is now my first implementation with the new caching strategy:

```#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;

my %cache = (2 => 2);

sub collatz_seq {
my \$input = shift;
my \$n = \$input;
my \$result = 0;
while (\$n != 1) {
if (exists \$cache{\$n}) {
\$result += \$cache{\$n};
last;
} else {
my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
\$result++;
\$cache{\$n} = \$cache{\$new_n} + 1
if defined \$cache{\$new_n} and \$n < MAX;
\$n = \$new_n;
}
}
\$cache{\$input} = \$result if \$input < MAX;
return \$result;
}

my @long_seqs;
for my \$num (1..1000000) {
my \$seq_length = collatz_seq \$num;
push @long_seqs, [ \$num, \$seq_length ] if \$seq_length > 400;
}

@long_seqs = sort { \$b->[1] <=> \$a->[1]} @long_seqs;
say  "\$_->[0]: \$_->[1]" for @long_seqs[0..19];
This program produces the same outcome, but is nearly 3 times faster:
```real    0m34,207s
user    0m34,108s
sys     0m0,124s
But we now end up with a cache having essentially one entry per input number in the 1..1_000_000 range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup).

This is the code for this new implementation:

```#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;

my @cache = (0, 1, 2);

sub collatz_seq {
my \$input = shift;
my \$n = \$input;
my \$result = 0;
while (\$n != 1) {
if (defined \$cache[\$n]) {
\$result += \$cache[\$n];
last;
} else {
my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n / 2;
\$result++;
\$cache[\$n] = \$cache[\$new_n] + 1
if defined \$cache[\$new_n] and \$n < MAX;
\$n = \$new_n;
}
}
\$cache[\$input] = \$result if \$input < MAX;
return \$result;
}

my @long_seqs;
for my \$num (1..1000000) {
my \$seq_length = collatz_seq \$num;
push @long_seqs, [ \$num, \$seq_length ] if \$seq_length > 400;
}

@long_seqs = sort { \$b->[1] <=> \$a->[1]} @long_seqs;
say  "\$_->[0]: \$_->[1]" for @long_seqs[0..19];
With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache):
```\$ time perl collatz3.pl
837799: 525
626331: 509
[Lines omitted for brevity]
922524: 445
922525: 445

real    0m1,755s
user    0m1,687s
sys     0m0,061s
I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it.

So, it is true that throwing more CPU cores at the problem makes the solution faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution.

Hi Laurent_R and fellow monks,

Nice speedup! I applied 3 optimizations in preparation for a follow-up post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration.

Update: Further improvements, Step 4.

1. Replaced division by 2.

```\$n >> 1;

2. Removed one level of branching.

```    while (\$n != 1) {
\$result += \$cache[\$n], last
if defined \$cache[\$n];

my \$new_n = \$n % 2 ? 3 * \$n + 1 : \$n >> 1;
\$result++;
\$cache[\$n] = \$cache[\$new_n] + 1
if defined \$cache[\$new_n] and \$n < \$max;

\$n = \$new_n;
}

3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation).

```    while (\$n != 1) {
\$result += \$cache[\$n], last
if defined \$cache[\$n];

\$n % 2 ? ( \$result += 2, \$new_n = (3 * \$n + 1) >> 1 )
: ( \$result += 1, \$new_n = \$n >> 1 );

\$cache[\$n] = \$cache[\$new_n] + (\$n % 2 ? 2 : 1)
if defined \$cache[\$new_n] and \$n < \$max;

\$n = \$new_n;
}

4. Finally, less caching.

```    while (\$n != 1) {
\$result += \$cache[\$n], last
if defined \$cache[\$n];

\$n % 2 ? ( \$result += 2, \$new_n = (3 * \$n + 1) >> 1 )
: ( \$result += 1, \$new_n = \$n >> 1 );

\$n = \$new_n;
}

The final code optionally takes an argument.

```#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;

my \$max = shift || 1e6;
\$max = 1e6 if \$max < 1e6;

my @cache = (0, 1, 2);

sub collatz_seq {
my \$input = shift;
my \$n = \$input;
my \$result = 0;
my \$new_n;

while (\$n != 1) {
\$result += \$cache[\$n], last
if defined \$cache[\$n];

\$n % 2 ? ( \$result += 2, \$new_n = (3 * \$n + 1) >> 1 )
: ( \$result += 1, \$new_n = \$n >> 1 );

\$n = \$new_n;
}

\$cache[\$input] = \$result if \$input < \$max;
return \$result;
}

my @long_seqs;
for my \$num (1..\$max) {
my \$seq_length = collatz_seq \$num;
push @long_seqs, [ \$num, \$seq_length ] if \$seq_length > 400;
}

@long_seqs = sort { \$b->[1] <=> \$a->[1]} @long_seqs;
say "\$_->[0]: \$_->[1]" for @long_seqs[0..19];

Results:

```\$ time perl collatz3_a.pl 1e7

# Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1

collatz3_a.pl 1e7   32.291s  (a) original code, accepts argument
collatz3_b.pl 1e7   30.134s  (b) a + replaced division with >> 1
collatz3_c.pl 1e7   28.503s  (c) b + removed 1 level of branching
collatz3_d.pl 1e7   21.464s  (d) c + reduced loop iterations
collatz3_e.pl 1e7   19.357s  (e) d + less caching

# AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1

collatz3_a.pl 1e7   13.130s  (a) original code, accepts argument
collatz3_b.pl 1e7   12.394s  (b) a + replaced division with >> 1
collatz3_c.pl 1e7   12.261s  (c) b + removed 1 level of branching
collatz3_d.pl 1e7    9.170s  (d) c + reduced loop iterations
collatz3_e.pl 1e7    7.681s  (e) d + less caching

8400511: 686
8865705: 668
6649279: 665
9973919: 663
6674175: 621
7332399: 616
7532665: 616
5649499: 613
8474249: 611
6355687: 608
8847225: 606
9533531: 606
6635419: 603
9953129: 601
7464846: 598
7464847: 598
3732423: 597
5598635: 595
8397953: 593
6298465: 590

Regards, Mario

Hi Mario,

I did not expect such micro-optimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you.

Thank you very much for your challenging ideas.

Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)
by vr (Curate) on Apr 21, 2020 at 11:14 UTC

Here's an attempt to solve this Collatz challenge in purely vectorized fashion. As it (slowly) progressed, it soon became apparent the emerging result is somewhat impractical, merely a curiosity; now it's finished (i.e. I can't find how to improve it) and ready to amaze/amuse/frighten the public.

The "PDL" was mentioned in this (or 2 related) threads, but it served just as binary container, "set/at" having same role as "substr + pack/unpack", so it wasn't really "PDL", was it. This node sticks to original task as stated:

calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences

The naive straightforward implementation is simple, it should read without much comment (as "prose":)):

```use strict;
use warnings;
use feature 'say';
use PDL;

use Time::HiRes 'time';
my \$t = time;

use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;

my \$seqs = 1 + sequence( longlong, MAX );
my \$lengths = ones( short, MAX );

while ( any my \$good_mask = \$seqs-> inplace
-> isgood ) {
my \$odd_mask = \$seqs & 1;
( \$seqs-> where( \$odd_mask ) *= 3 ) ++;
\$seqs >>= 1;
}

my \$top_i = \$lengths-> qsorti
-> slice([ MAX - 1, MAX - TOP ]);

say \$lengths-> index( \$top_i )
-> longlong
-> cat( \$top_i + 1 )
-> transpose;
say time - \$t;

__END__

[
[   525 837799]
[   509 626331]
...
[   445 938143]
[   445 906175]
[   445 922525]
[   445 922526]
]

7.98023009300232

Above, completed sequences are marked as "BAD", which is just agreed-upon value, treated specially. (1) There are built-in methods to check for good/bad, it saves us comparisons while creating masks. (2) More important: BADs kind of taint whatever they interact with (cf. NaN in FP math), which is quite useful.

The timing is sloooooow, yet decent among non-caching solutions around here. Another issue stems from QuickSort being unstable (I recently ranted, in this thread, about neat ordering and "correct 20" extraction).

To fix the latter, there's qsortvec (and qsortveci companion) method to sort 2D array (as opposed to "qsort" for 1D, used above), i.e. 1st on 1st axis, then on 2nd. But here's dilemma: (1) build full-height (2 x 1e6) array, qsortvec, extract top-20. Possible, but, for speed, I'd prefer (2): qsort lengths (as above), extract "many enough" but close to 20, build small "2 x N" array, qsortvec, extract correct (and correctly arranged) top-20.

For that, find value at MAX - TOP (445), look left, find how much to extract (22). More fuss: 1st column is to descend, 2nd to ascend -- thus temporarily negate (flip bits) one of them. So, huge and unpleasantly looking new "tail" after main loop, in script below, is there to fix top-20 extraction. But in fact it adds almost nothing to consumed time.

```[ BAD BAD 3 4 5 ... ]       # initial \$seqs
[ 1 2 2 2 2 ... ]           # initial \$lengths

Other than that, it's the same non-caching approach, with original clarity and simplicity somewhat spoiled by fixes/optimizations:

```use strict;
use warnings;
use feature 'say';
use PDL;

use Time::HiRes 'time';
my \$t = time;

use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;

my \$seqs = 1 + sequence( longlong, MAX );

my \$lengths = ones( short, MAX );
\$lengths <<= 1;
\$lengths-> set( 0, 1 );

while ( any my \$good_mask = \$seqs-> isgood ) {

= where( \$seqs, \$lengths, \$seqs & 1 );

( \$seqs_odd *= 3 ) ++;
\$seqs >>= 1;
}

my \$sorted_i = \$lengths-> qsorti;
my \$sorted   = \$lengths-> index( \$sorted_i );
my \$value    = \$sorted-> at( MAX - TOP );
my \$pos      = vsearch_insert_leftmost( \$value, \$sorted );
my \$top_i    = \$sorted_i-> slice([ MAX - 1 , \$pos ]);

( my \$result = \$lengths
-> index( \$top_i )
-> longlong
-> bitnot
-> cat( \$top_i + 1 )
-> transpose
-> qsortvec
-> slice([], [ 0, TOP - 1 ])

)-> slice([ 0 ], [])
-> inplace
-> bitnot;

say \$result;
say time - \$t;

__END__

[
[   525 837799]
[   509 626331]
...
[   445 886953]
[   445 906175]
[   445 922524]
[   445 922525]
]

6.0809600353241

So we have correct output and improved time. Good.

```

```

Now to something more interesting -- let's add caching/looking-up. Because we are to use \$seqs as index into \$lengths, and indexing starts from 0, let's prepend a dummy 0th element. To kick-start indexing, and because value "2" is occupied to mark "BAD" i.e. sequence stopper, we'll add one more seed element to lengths. Further, lengths will now all start as BAD, and switched to computed values as we go:

```[ BAD BAD BAD 3 4 5 6 ... ]     # initial \$seqs

(by the way, BAD in \$lengths is still the default, for short, -32768)

We'll also maintain \$current lengths helper piddle, incremented by 1 or 2 depending on oddity mask of current sequences state. Observe, further, how where calls in list context are (over)-abused in code below. (I'm quite aware this code is no longer "a prose to read". Set MAX to 10 and dump primary piddles on each iteration to see what's going on. There are same 3. Other vars are masked views into them.)

```use strict;
use warnings;
use feature 'say';
use PDL;

use Time::HiRes 'time';
my \$t = time;

use constant MAX => 1e6;
use constant TOP => MAX < 20 ? MAX : 20;

my \$seqs = sequence( longlong, 1 + MAX );

my \$lengths = ones( short, 1 + MAX );
\$lengths-> set( 1, 1 );
\$lengths-> set( 2, 2 );
\$lengths-> set( 4, 3 );

my \$current = zeroes( short, 1 + MAX );

while ( any \$seqs-> isgood ) {      # sic

= where( \$seqs, \$current, \$seqs & 1 );

\$current ++;
( \$seqs_odd *= 3 ) ++;
\$seqs >>= 1;

my ( \$seqs_cap, \$lengths_cap, \$current_cap )
= where( \$seqs, \$lengths, \$current, \$seqs <= MAX );

my \$lut = \$lengths-> index( \$seqs_cap );

# "_f" is for "finished"

my ( \$seqs_f, \$lengths_f, \$lut_f, \$current_f )
= where( \$seqs_cap, \$lengths_cap, \$lut, \$current_cap,
\$lut-> isgood );

\$lengths_f .= \$lut_f + \$current_f;
\$seqs_f    .= 2;                    # i.e. BAD
}

my \$sorted_i = \$lengths-> qsorti;
my \$sorted   = \$lengths-> index( \$sorted_i );
my \$value    = \$sorted-> at( MAX + 1 - TOP );
my \$pos      = vsearch_insert_leftmost( \$value, \$sorted );
my \$top_i    = \$sorted_i-> slice([ MAX, \$pos ]);

( my \$result = \$lengths
-> index( \$top_i )
-> longlong
-> bitnot
-> cat( \$top_i )
-> transpose
-> qsortvec
-> slice([], [ 0, TOP - 1 ])

)-> slice([ 0 ], [])
-> inplace
-> bitnot;

say \$result;
say time - \$t;

__END__

[
[   525 837799]
[   509 626331]
...
[   445 886953]
[   445 906175]
[   445 922524]
[   445 922525]
]

2.88385105133057

And that's (~2x faster) I'm afraid is as good as it will go. As I understand, cache hits are significantly more rare than with consequential element after element array processing. For comparison, with the same hardware, Laurent_R's final/polished caching solution runs at 1.63s here, if I disable use of "magic number" 400 in there ("magic" constants to crank up performance aren't fair:)), and at 0.88s otherwise.

For 1e7 numbers, running time becomes ~65s, i.e. it gets impractical, like I said. Switching on parallel processing made no difference. Use of multiple cores can be observed for only ~first second, then "viewports" into piddles become increasingly fragmented, work can't be split.

Maybe I did something wrong, and certainly someone can improve even if a little bit, but I'm glad this mini-project is finally off my shoulders.

Hi vr,

Lots of PDL goodness. Wow! I do not know how this will perform unless taking a moment or two and give it a try. So here is a parallel version based on your 2nd example.

```use strict;
use warnings;
use feature 'say';
use PDL;
use MCE::Flow;
use MCE::Candy;

use Time::HiRes 'time';
my \$t = time;

# PDL Quick Reference
# https://www.perlmonks.org/?node_id=1214437

sub collatz_seq {
my ( \$chunk_id, \$seq_beg, \$seq_end ) = @_;
my \$max = \$seq_end - \$seq_beg + 2;
my \$top = \$max < 20 ? \$max : 20;

my \$seqs = pdl( longlong, 1, \$seq_beg..\$seq_end );

my \$lengths = 1 + ones( short, \$max );
\$lengths-> set( 0, 1 );

while ( any my \$good_mask = \$seqs-> isgood ) {

= where( \$seqs, \$lengths, \$seqs & 1 );

( \$seqs_odd *= 3 ) ++;
\$seqs >>= 1;
}

my \$sorted_i = \$lengths-> qsorti;
my \$sorted   = \$lengths-> index( \$sorted_i );
my \$value    = \$sorted-> at( \$max - \$top );
my \$pos      = vsearch_insert_leftmost( \$value, \$sorted );
my \$top_i    = \$sorted_i-> slice([ \$max - 1 , \$pos ]);

( my \$result = \$lengths
-> index( \$top_i )
-> longlong
-> bitnot
-> cat( \$top_i + 1 )
-> transpose
-> qsortvec
-> slice([], [ 0, \$top - 1 ])

)-> slice([ 0 ], [])
-> inplace
-> bitnot;

# From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ],
my \$str = \$result->string;
\$str =~ s/(\d+)\s+(\d+)(.*)/\$2,\$1\$3,/g;

my \$ret = eval \$str;
\$_->[0] = \$_->[0] + \$seq_beg - 2 for @\$ret;

MCE->gather( \$chunk_id, @\$ret );
}

my \$size = shift || 1e6;
\$size = 1e6 if \$size < 1e6;  # minimum
\$size = 1e9 if \$size > 1e9;  # maximum

my \$chunk_size = \$size >> 5;
my @seqs;

mce_flow_s {
max_workers => MCE::Util::get_ncpu(),
chunk_size  => \$chunk_size > 100000 ? 100000 : \$chunk_size,
bounds_only => 1,
gather      => MCE::Candy::out_iter_array(\@seqs),
}, sub {
my ( \$mce, \$chunk_ref, \$chunk_id ) = @_;
collatz_seq( \$chunk_id, @{ \$chunk_ref } );
}, 2, \$size;

MCE::Flow->finish;

@seqs = ( sort { \$b->[1] <=> \$a->[1]} @seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @\$_
for @seqs;

say {*STDERR} time - \$t;

Results:

I was expecting for mce_pdl2 using 1 core to be closer to vr_pdl2 than vr_pdl3. Maybe benefitting from CPU L2/L3 cache. Chunking seems to be the reason why mce_pdl2 (non-caching) ran nearly as fast as vr_pdl3 (caching). Below, includes 1 core testing for 1e7 with various chunk sizes.

```1e7 testing
vr_pdl3:    46.328s  cache
vr_pdl2:  1m16.058s  non-cache
mce_pdl2:  1m15.583s  chunk_size => \$size
mce_pdl2:  1m03.709s  chunk_size => \$size >> 1
mce_pdl2:    52.352s  chunk_size => \$size >> 2
mce_pdl2:    51.323s  chunk_size => \$size >> 3
mce_pdl2:    49.396s  chunk_size => \$size >> 4
mce_pdl2:    48.195s  chunk_size => \$size >> 5
mce_pdl2:    48.369s  chunk_size => \$size >> 6
mce_pdl2:    48.501s  chunk_size => \$size >> 7

chunk_size => 300000
mce_pdl2:    48.195s   1 core
mce_pdl2:    25.311s   2 cores
mce_pdl2:    14.085s   4 cores
mce_pdl2:     7.650s   8 cores
mce_pdl2:     4.517s  16 cores
mce_pdl2:     3.721s  32 cores

chunk_size => 100000
mce_pdl2:    48.395s   1 core
mce_pdl2:    25.402s   2 cores
mce_pdl2:    13.163s   4 cores
mce_pdl2:     6.860s   8 cores
mce_pdl2:     3.850s  16 cores
mce_pdl2:     2.347s  32 cores

Output
Collatz(8400511) has sequence length of 686 steps
Collatz(8865705) has sequence length of 668 steps
Collatz(6649279) has sequence length of 665 steps
Collatz(9973919) has sequence length of 663 steps
Collatz(6674175) has sequence length of 621 steps
Collatz(7332399) has sequence length of 616 steps
Collatz(7532665) has sequence length of 616 steps
Collatz(5649499) has sequence length of 613 steps
Collatz(8474249) has sequence length of 611 steps
Collatz(6355687) has sequence length of 608 steps
Collatz(8847225) has sequence length of 606 steps
Collatz(9533531) has sequence length of 606 steps
Collatz(6635419) has sequence length of 603 steps
Collatz(9953129) has sequence length of 601 steps
Collatz(7464846) has sequence length of 598 steps
Collatz(7464847) has sequence length of 598 steps
Collatz(3732423) has sequence length of 597 steps
Collatz(5598635) has sequence length of 595 steps
Collatz(8397953) has sequence length of 593 steps
Collatz(6298465) has sequence length of 590 steps
```1e8 testing
vr_pdl3:  9m44.667s  cache
vr_pdl2: 16m12.467s  non-cache

chunk_size => 300000
mce_pdl2:  9m06.078s   1 core
mce_pdl2:  4m43.529s   2 cores
mce_pdl2:  2m33.136s   4 cores
mce_pdl2:  1m21.434s   8 cores
mce_pdl2:    45.266s  16 cores
mce_pdl2:    36.925s  32 cores

chunk_size => 100000
mce_pdl2:  9m09.950s   1 core
mce_pdl2:  4m39.677s   2 cores
mce_pdl2:  2m24.230s   4 cores
mce_pdl2:  1m13.353s   8 cores
mce_pdl2:    37.923s  16 cores
mce_pdl2:    20.099s  32 cores

Output
Collatz(63728127) has sequence length of 950 steps
Collatz(95592191) has sequence length of 948 steps
Collatz(96883183) has sequence length of 811 steps
Collatz(86010015) has sequence length of 798 steps
Collatz(98110761) has sequence length of 749 steps
Collatz(73583070) has sequence length of 746 steps
Collatz(73583071) has sequence length of 746 steps
Collatz(36791535) has sequence length of 745 steps
Collatz(55187303) has sequence length of 743 steps
Collatz(56924955) has sequence length of 743 steps
Collatz(82780955) has sequence length of 741 steps
Collatz(85387433) has sequence length of 741 steps
Collatz(63101607) has sequence length of 738 steps
Collatz(64040575) has sequence length of 738 steps
Collatz(93128574) has sequence length of 736 steps
Collatz(93128575) has sequence length of 736 steps
Collatz(94652411) has sequence length of 736 steps
Collatz(96060863) has sequence length of 736 steps
Collatz(46564287) has sequence length of 735 steps
Collatz(69846431) has sequence length of 733 steps

Specifying 50000 for chunk size may run faster on 4/6/8 core machines.

```8 cores
mce_pdl2 1e8  16.660s  chunk_size => 300000
mce_pdl2 1e8  10.471s  chunk_size => 100000
mce_pdl2 1e8   9.773s  chunk_size =>  50000

Regards, Mario

Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :).

My "caching" PDL solution (I'd prefer "looking-up" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge:

23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1

Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem.

This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The \$current piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element.

For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5.

Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz \$lengths for even numbers in range 22..40, also marking these \$seqs elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even \$seqs element outside 1st chunk as BAD, even if CLs are unknown yet.

2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speed-up is very modest, just ~0.5s.

Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically).

If it's cheap to shift-left indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this.

I tried to add masks/slices/bads for \$current, so that e.g. the line \$current ++; applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight".

I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up.

```use strict;
use warnings;
use feature 'say';
use PDL;

use List::Util;
BEGIN { *_min = \&List::Util::min;          # collision
*_max = \&List::Util::max }         # with PDL

use constant MAX    => 1e7;
use constant TOP    => _min( 20, MAX );
use constant CHUNK  => _min( 8e4, MAX );    # but keep it even
use constant MAXLEN => MAX * 1;             # ?? # x(1..2)

use Time::HiRes 'time';
my \$t = time;

my \$seqs = sequence( longlong, 1 + MAX );

\$seqs-> slice([ CHUNK + 2, MAX, 2]) .= 2        ##
if CHUNK + 2 <= MAX;                        ##

my \$lengths = ones( short, 1 + MAXLEN );
\$lengths-> set( 1, 1 );
\$lengths-> set( 2, 2 );
\$lengths-> set( 4, 3 );

CHUNKS:
for ( my \$from = my \$to = 0; \$to != MAX; \$from = \$to + 1 ) {
\$to = _min( \$from + CHUNK, MAX );

# "_c" is for "chunk"

my \$seqs_c    = \$seqs->    slice([ \$from, \$to ]);
my \$lengths_c = \$lengths-> slice([ \$from, \$to ]);
my \$current   = zeroes( short, nelem( \$seqs_c ));

ITERATIONS:
while ( any \$seqs_c-> isgood ) {

= where( \$seqs_c, \$current, \$seqs_c & 1 );

\$current ++;
( \$seqs_c_odd *= 3 ) ++;
\$seqs_c >>= 1;

my ( \$seqs_cap, \$lengths_cap, \$current_cap )
= where( \$seqs_c, \$lengths_c, \$current,
\$seqs_c <= MAXLEN );

my \$lut = \$lengths-> index( \$seqs_cap );

# "_f" is for "finished"

my ( \$seqs_f, \$lengths_f, \$lut_f, \$current_f )
= where( \$seqs_cap, \$lengths_cap, \$lut, \$current_cap,
\$lut-> isgood );

\$lengths_f .= \$lut_f + \$current_f;
\$seqs_f    .= 2;                    # i.e. BAD
}

# "_e" is for "at even positions, ahead"                    ##
##
#   my \$from_e = _max( \$from * 2, \$to ) + 2;        # bug       ##
my \$from_e = \$from == 0 ? \$to + 2 : \$from * 2;  # fixed     ##
my \$to_e   = _min( \$to * 2, MAXLEN );                       ##
##
( \$lengths-> slice([ \$from_e, \$to_e, 2 ])                   ##
.= \$lengths-> slice([ \$from_e / 2, \$to_e / 2 ])) ++     ##
if \$from_e <= MAXLEN                                ##
}

# same finale

\$lengths = \$lengths-> slice([ 1, MAX ]);

my \$sorted_i = \$lengths-> qsorti;
my \$sorted   = \$lengths-> index( \$sorted_i );
my \$value    = \$sorted-> at( MAX - TOP );
my \$pos      = vsearch_insert_leftmost( \$value, \$sorted );
my \$top_i    = \$sorted_i-> slice([ MAX - 1, \$pos ]);

( my \$result = \$lengths
-> index( \$top_i )
-> longlong
-> bitnot
-> cat( \$top_i + 1 )
-> transpose
-> qsortvec
-> slice([], [ 0, TOP - 1 ])

)-> slice([ 0 ], [])
-> inplace
-> bitnot;

say \$result;
say time - \$t;

__END__

Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "0-10,11-20,21-30, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry.

Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 11:16 UTC
Hi Nick,

for some reason, I somehow missed your very interesting post and the rest of this thread until just now.

In fact, a couple of days after I submitted my solution to the Perl Weekly Challenge and posted my blog post you refer to above, I figured out that my caching strategy was in fact quite inefficient: the program doesn't need to store in the cache the full sequence, it would be enough to just store the number of its items. And that reduces considerably the overhead of the cache.

I did not try this change in Perl so far, but I did it with the Raku solution. Changing the caching strategy made the Raku program 6 times faster. It seems the changes I made are quite similar to what vr suggested. These results are described in another blog that I haven't completed yet, but will be published hopefully soon.

Now that I see that you have brought this up, I feel compelled to implement this change in the Perl version. I don't have time right now, but will come back with my updated version and the relevant timings soon. (Update: see below.)

Thank you anyway for your very interesting post and for the just as interesting discussion it triggered.

A reply falls below the community's threshold of quality. You may see it by logging in.

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11115088]
Approved by marto
Front-paged by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-02-21 02:07 GMT
Voting Booth?
My favourite way to spend a leap day ...

Results (21 votes). Check out past polls.