http://www.perlmonks.org?node_id=1208610

biosub has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

I'm a bit of a newbie in Perl, I'm a biologist who codes for bioinformatics and I don't come across Perl that often. But here I am..

I'm trying to implement parallelization in someone else code that searches through a very big parameter space (a.k.a. all the possible combinations of the possible values of the defined parameters).
The whole scripts (here's just a minimal example, computation after the loops is excluded) takes around 4 days to run, I need to run it with several different settings... here it goes the need for parallization.

I managed to use Parallel:ForkManager for the rest of the code but I'm stuck with these nested loops. Here's the code:

use strict; use warnings; use Parallel::ForkManager; my $fork_manager = Parallel::ForkManager -> new ( 32 ); my @p = split ("1,0.6,0.4,0.1,0.6,0,0.4,0.4,1,0.5,1,1,1"); my @param_arr = (); $fork_manager -> run_on_finish(sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_stru +cture_reference) = @_; # retrieve data struct +ure from child if (defined($data_structure_reference)) { # if children pass a s +tring (they all should here, and it should be the $temp_out) my $out = ${$data_structure_reference}; # append string passed + by child to array push @param_arr, $out; # append string passed + by child to array } else { # issue warning if not + having output as expected print "No message received from child process $pid!\n"; };} ); # my $fm_outer_1 = new Parallel::ForkManager -> new ( 6 ) # loop through desired combinations for ($p[0]=0; $p[0]<=1; $p[0]+=0.2){ for ($p[1]=0; $p[1]<=1; $p[1]+=0.2){ for ($p[2]=0; $p[2]<=1; $p[2]+=0.2){ for ($p[3]=0; $p[3]<=1; $p[3]+=0.2){ for ($p[4]=0; $p[4]<=1; $p[4]+=0.2){ for ($p[5]=0; $p[5]<=1; $p[5]+=0.2){ for ($p[6]=0; $p[6]<=1; $p[6]+=0.2){ for ($p[7]=0; $p[7]<=1; $p[7]+=0.2){ for ($p[8]=0; $p[8]<=1; $p[8]+=0.2){ for ($p[9]=0; $p[9]<=1; $p[9]+=0.2 +){ # $fm_outer_1 -> start and nex +t; for ($p[10]=0; $p[10]<=1; $p[1 +0]+=0.2){ $fork_manager -> start and + next; $p[11]=1; $p[12]=1; #------------- $temp_out = "$p[0]\t$p[1]\ +t$p[2]\t$p[3]\t$p[4]\t$p[5]\t$p[6]\t$p[7]\t$p[8]\t$p[9]\t$p[10]\t$p[1 +1]\t$p[12]\n"; #------------- $fork_manager -> finish(0, + \$temp_out) # return $temp_out to parent }; # $fm_outer_1 -> finish() };};};};};};};};};}; $fork_manager -> wait_all_children(); $fm_outer_1 -> wait_all_children(); # fun with @param_arr..
The problem is that with the forking inside the last loop it only starts max 6 processes, because with the current settings each loop runs 6 times. Although being an improvement this is not what I aim for.
I tried putting the forking higher in the loop structure: it does run more procesess but it only returns the last combination of the loop.
I tried nested forking as suggested here https://www.perlmonks.org/?node_id=973304 (the commented out code with $fm_outer_1 variable) but it only issues and endless repetition of the warning:
Cannot start another process while you are in the child process at /us +r/.../ForkManager.pm line 467
and a few interspersed warnings of children not returining anything to the parent.
I tried setting up data retrival as for the $fork_manager and to shuffle around the positions of the wait_all_children() but without luck.

I resort to the Monks for help in either write a proper parallelization (I'm running on a 64 core server, so I'd like to use the power that's there ;) ) or, if exists, to point me to some package/module that can do the same thing in a bit more clean way that writing a bunch of nested loops.

I hope everything is clear!
Thanks!!!

Replies are listed 'Best First'.
Re: Parallelization of multiple nested loops
by BrowserUk (Patriarch) on Feb 07, 2018 at 13:44 UTC

    Starting a new process to construct a single string and then pass it back to the parent (and its parent and its parent...) for accumulation is crazy -- and as you know, slow.

    This code produces the same data into a file in less than 1 hour:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use Algorithm::Combinatorics qw[ variations_with_repetition ]; my $start = time; my @in = qw[ 0.0 0.2 0.4 0.6 0.8 1.0 ]; my $iter = variations_with_repetition( \@in, 11 ); my $c = 0; while( $_ = $iter->next ) { print join "\t", @$_, 1 , 1; ++$c; } printf STDERR "Took: %f seconds [$c]\n", time() - $start; __END__ C:\test>1208610 | wc -l Took: 3444.838110 seconds [362797056] 362797056

    And you can read it from that file, one line at a time to avoid blowing your memory, in about 5 minutes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      This works like a charm! 20 min on my computer :) Thanks a LOT
Re: Parallelization of multiple nested loops
by Eily (Monsignor) on Feb 07, 2018 at 10:22 UTC

    I tried putting the forking higher in the loop structure: it does run more procesess but it only returns the last combination of the loop.
    Well of course it does, because you overwrite $temp_out everytime. Or seen another way, each child will only finish once, so you can only expect one value out of it. One way you could make this work is by having the children return arrays instead of strings.

    And by the way, your $temp_out isn't defined anywhere, so with strict this means that your program doesn't compile. Without strict, this is a global(ish) variable, so since every child writes to it this makes your script a little confusing. Of course you can check if with ForkManager each child has its own copy of the variable, but it's just better practice to make it clear that they each have their own version. So create a new array for each child:

    for my $outer (1..10) { $fork_manager->start() and next; { my @output; for my $inner (1..10) { push @output, "[$outer] [$inner]"; } $fork_manager->finish(0, \@output); } }

    Edit: also, do think about benchmarking the result, the data sent from child to parent goes to the hard drive (where accesses are slower than CPU operations), so depending on how heavy the process on the inner loop is, using all available cores might not be the best solution (if only because they will access the same drive, so one will have to wait for the other to complete before writing).

Re: Parallelization of multiple nested loops
by salva (Canon) on Feb 07, 2018 at 11:51 UTC
    I would like to add that usually you should avoid using non-integer numbers as control variables in loops. The float representation used by the computer may introduce rounding errors and break your logic.

    For instance, on my computer:

    for ($i = 0; $i < 1; $i += 0.2) {}; say $i; #==> 1.0 for ($i = 0; $i < 2; $i += 0.2) {}; say $i; #==> 2.2

    So, the first loops works right, but the second one runs an extra iteration unexpectedly!

    The right approach in this cases is to use an intermediate integer variable:

    for ($ix = 0; $ix < 10; $ix++) { my $i = $ix * 0.2; ... }

      salva, biosub:

      I'd suggest something like:

      # customize to suit my @param_space = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0); for my $i (@param_space) { for my $j (@param_space) { ... } }

      This clarifies the code a little and might remove a few operations from the optree. It makes it simple to change the distribution (if desired), and you can use more arrays if you want to treat some parameters slightly differently.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

Re: Parallelization of multiple nested loops
by QM (Parson) on Feb 07, 2018 at 11:53 UTC
    You could also rework the nested loops into permutations or globs, and just have a single loop with an iterator.

    I would naively suggest glob, but for your example, it eats up memory, and probably doesn't finish:

    #!/usr/bin/env perl use strict; use warnings; # dummy values for @p, to get the last values not in a loop my @p = 0..12; # a single value # (this can be generalized, as parameters to a function to generate th +is string) my $glob_token = '{0.0,0.2,0.4,0.6,0.8,1.0}'; # join into multiple value string # (must use non-whitespace, as the builtin glob treats whitespace spec +ial) my $glob_parm = join(",", ($glob_token) x 11); # Loop fork for my $parm_list (glob($glob_parm)) { # now fix the commas $parm_list =~ s/,/\t/g; my $temp_out = "$parm_list\t$p[11]\t$p[12]\n"; print $temp_out; }

    A similar solution using permutations a combination iterator would work. I'll have to see if I can find my array-of-arrays permutation iterator.

    Update: This node on this thread should help.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Parallelization of multiple nested loops
by salva (Canon) on Feb 07, 2018 at 10:51 UTC
    You are probably forking too much.

    Instead of forking in the most inner loop, do it at the second or third level, so that instead of launching 5**10 6**11 little processes, you launch just a few hundreds.

    Update: Oh, sorry, I didn't read your post fully. It seems you have already tried to do that. If your problem is getting the results back, the simplest solution in my experience is to have every process write its part of the computation into a file and then have a last stage where all the partial outputs are merged. In most cases this is also good enough in terms of computational cost.

    The alternative (as you are really doing under the hood in your code by using the on finish hooks) is to have the slave process serialize the partial results and pipe then to the master process which then merges all of then. You avoid the cost of writing and reading the intermediate data to the file system, but on the other hand everything has to be in RAM.

      This contradicts this:

      The problem is that with the forking inside the last loop it only starts max 6 processes
      Which I did not notice at first but reading your post made me realize there might be something wrong here. Maybe the inner loop is so fast that the first child finishes before the seventh even starts? Which would make that loop a pretty bad candidate for parallelisation. Then again, this might be true with this test code, but not the actual program.

        There may be at most 6 processes running in parallel. But my numbers refer to the total number of processes created and destroyed during the complete computation.

        My point is that starting and stopping processes have a non insignificant overhead. If you start too many the total cost of forking may be an important percentage of the total cost of the computation.

Re: Parallelization of multiple nested loops
by pryrt (Abbot) on Feb 07, 2018 at 17:31 UTC

    There have been a lot of good replies, but I'm going to recommend making things as generic as possible: the chances are, you're going to want to do something similar again later, but maybe with a different number of parameters, or maybe one or more parameter uses a different set of values. Thus, wrap it in a function where you can programmatically change those.

    Note: with hardcoded loops, like you show, if you want to change the number of parameters, you need to copy/paste more nested loops (and fix the indentation). And if you want to vary 11 of the 13 parameters on one run, but only 5 of the 13 parameters on the next run, then you have to remove (or comment out) some of the levels of nesting. That is not easy for on-the-fly changes.

    If all the parameters want to take on the same values, then BrowserUK's solution is excellent (I need to learn to better think in terms of true iterators and the Algorithm::Combinatorics for such problems). But if one or more of the parameters needs a different set of values, then it's not I don't see it being (updated phrasing) easily extensible.

    When I see loops nested more than about 2 deep, especially when all the work is only in the deepest nest, and doubly especially when the iterators are themselves an array, I think of converting it to one external loop on an iterator index, and one internal loop to generate the parameter array. After the parameter array is generated, but inside the iterator loop, then do the processing.

    So, combining those, here's a suggested implementation, where the action taken is just printing the results, either to STDOUT or to a file.

    While thinking about it after I tested, I would actually add another parameter to the function, which expects a CODEREF, that way you could then run $function->(@p) in the place of the print-statement, and make it even more generic.

Re: Parallelization of multiple nested loops
by ikegami (Patriarch) on Feb 07, 2018 at 23:50 UTC

    You are actually making your program slower by forking! You are doing so little in the children that deserializing the results from the child is slower than actually generating them in the first place! You need to do a lot more work per child.

    You had the right idea about including more levels of looping in the child.

    use Algorithm::Loops qw( NestedLoops ); use Parallel::ForkManager qw( ); my $num_jobs_log6 = 4; # 6*6*6*6 jobs. my $num_workers = 64; # 20.25 jobs/worker (on average). my $pm = Parallel::ForkManager->new($num_workers); my @param_arr; $pm->run_on_finish(sub{ my ($pid, $exit_code, $ident, $exit_signal, undef, $data) = @_; die("Child killed by signal $exit_signal\n") if $exit_signal; die("Child return error $exit_code\n") if $exit_code; push @param_arr, @$data; }); my @ps = map { $_ * 0.2 } 0..5; my $outer_iter = NestedLoops([ (\@ps) * $num_jobs_log6 ]); while (my @p_outer = $outer_iter->()) { my $p_outer = join("\t", @p_outer); $pm->start() and next; my @child_output; my $inner_iter = NestedLoops([ (\@ps) * ( 11 - $num_jobs_log6 ), [ 1 ], [ 1 ], ]); while (my @p_inner = $inner_iter->()) { push @child_output, $p_outer . join("\t", @p_inner) . "\n"; } $pm_finish(0, \@output); } $pm->wait_all_children();

    I'm still not convinced that this will be faster than doing it all in one process. You do more work in the child, but you also increase the amount of time needed to deserialize the result of the child.

    I think you are optimizing the wrong part of your code! If this part of the code is really slowing you down, just pre-generate the values in a file and load that file in the array.

    my @param_arr; { open(my $fh, '<', $qfn) or die("Can't open \"$qfn\": $!\n"); @param_arr = <$fh>; }
Re: Parallelization of multiple nested loops
by ikegami (Patriarch) on Feb 07, 2018 at 23:20 UTC

    You shouldn't use non-integral step amounts, as you accumulate floating-point error each pass through the loop.

    For example,

    $ perl -E'for (my $i=0; $i<10; $i+=1) { say $i }' | wc -l 10 $ perl -E'for (my $i=0; $i<1; $i+=0.1) { say $i }' | wc -l 11

    Replace

    for ($p[0]=0; $p[0]<=1; $p[0]+=0.2){

    with

    for (0..5) { $p[0] = $_*0.2;

    or with

    my @ps = map { $_ * 0.2 } 0..5; for (@ps) { $p[0] = $_;

    The second approach allows us to do the following:

    use Algorithm::Loops qw( NestedLoops ); my @ps = map { $_ * 0.2 } 0..5; my $iter = NestedLoops([ (\@ps) x 11 ]); while (my @p = $iter->()) { my $temp_out = join("\t", @p, 1, 1)."\n"; ... }
Re: Parallelization of multiple nested loops
by Anonymous Monk on Feb 07, 2018 at 11:44 UTC

    I'm trying to implement parallelization in someone else code that searches through a very big parameter space (a.k.a. all the possible combinations of the possible values of the defined parameters).
    Is it possible that you are solving an Optimization problem? I.e. could you define a loss function which would show how far are you from your goal? If that's the case, take a look at existing optimization libraries such as NLopt which take that function and arrive at the solution by faster means than space-exhausting search. They do make assumptions about "good behaviour" of said function (i.e. it should reasonably decrease when you are approaching closer to the solution) and some of them require partial derivatives of this function with respect to all variables optimized (but sometimes you can use a numeric finite difference approximation instead).

    Unfortunately, there are no NLopt bindings for Perl right now (which may make most of my post moot). Twice unfortunately, most optimizing algorithms are single-threaded by design (but there are exceptions). But if you write your task in C or C++ (Fortran isn't likely to be the language of choice for bioinformatics) you could use OpenMP to launch several parallel optimizing threads with different settings.

    NB: It is usually considered good thing to parallelize the most outer loops, not the most inner ones.

Re: Parallelization of multiple nested loops
by marioroy (Prior) on Feb 09, 2018 at 13:46 UTC

    Hi biosub,

    I tried again with 2nd attempt after not liking my initial attempt. The demonstrations that follow run on machines with 32 GiB of RAM, minimally.

    6 workers: 3.2x faster than Algorithm::Combinatorics on machines with 6 real-cores

    use strict; use warnings; # Run on UNIX machines with 32+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 14 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my ($p0,$c) = ($_,0); my ($p1,$p2,$p3,$p4,$p5,$p6,$p7,$p8,$p9,$p10); open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); for ($p1=0; $p1<=1; $p1+=0.2){ for ($p2=0; $p2<=1; $p2+=0.2){ for ($p3=0; $p3<=1; $p3+=0.2){ for ($p4=0; $p4<=1; $p4+=0.2){ for ($p5=0; $p5<=1; $p5+=0.2){ for ($p6=0; $p6<=1; $p6+=0.2){ for ($p7=0; $p7<=1; $p7+=0.2){ for ($p8=0; $p8<=1; $p8+=0.2){ for ($p9=0; $p9<=1; $p9+=0.2){ for ($p10=0; $p10<=1; $p10+=0. +2){ #------------- print $fh "$p0\t$p1\t$p2\t +$p3\t$p4\t$p5\t$p6\t$p7\t$p8\t$p9\t$p10\t1\t1\n"; ++$c; #------------- } } } } } } } } } } close $fh; $c_shared->incrby($c); } 0, 1, 0.2; # p0: seq_beg, seq_end, seq_step MCE::Loop::finish(); system("cat $tmp_dir/[1-6]; rm -fr $tmp_dir"); printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

    36 workers: use-case for a 64-way box (32 real-cores + 32 hyper-threads)

    This involves nested parallel loops, possible using MCE. The shared-counter variable increments fine no matter how many levels deep. Locking is handled automatically via the OO interface.

    use strict; use warnings; # Run on UNIX machines with 32+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 14 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my $p0 = $_; MCE::Loop::init { max_workers => 6, chunk_size => 1 }; $tmp_dir .= "/".MCE->chunk_id(); mkdir $tmp_dir; mce_loop_s { my ($p1,$c) = ($_,0); my ($p2,$p3,$p4,$p5,$p6,$p7,$p8,$p9,$p10); open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); for ($p2=0; $p2<=1; $p2+=0.2){ for ($p3=0; $p3<=1; $p3+=0.2){ for ($p4=0; $p4<=1; $p4+=0.2){ for ($p5=0; $p5<=1; $p5+=0.2){ for ($p6=0; $p6<=1; $p6+=0.2){ for ($p7=0; $p7<=1; $p7+=0.2){ for ($p8=0; $p8<=1; $p8+=0.2){ for ($p9=0; $p9<=1; $p9+=0.2){ for ($p10=0; $p10<=1; $p10+=0. +2){ #------------- print $fh "$p0\t$p1\t$p2\t +$p3\t$p4\t$p5\t$p6\t$p7\t$p8\t$p9\t$p10\t1\t1\n"; ++$c; #------------- } } } } } } } } } close $fh; $c_shared->incrby($c); } 0, 1, 0.2; # p1: seq_beg, seq_end, seq_step MCE::Loop::finish(); } 0, 1, 0.2; # p0: seq_beg, seq_end, seq_step MCE::Loop::finish(); system("cat $tmp_dir/$_/[1-6]; rm -fr $tmp_dir/$_") for 1..6; printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

    Results: taken from a 4.2 GHz machine with 8 real-cores, hyper-threads disabled

    Combinatorics : 459.752 seconds 6 workers : 145.695 seconds 36 workers : 109.134 seconds <- my CPU has 8 cores

    Consuming 32 real-cores and a little more is possible on a 64-way box. Afterwards, one may use MCE or a parallel module of choice to process the output file in parallel.

    Disclaimer: My Linux box is tuned to 4.2 GHz on all 8 cores. This is not common. What to take from this is that nested parallel loops is possible with care. On Linux, /dev/shm is beneficial for temporary storage.

    Regards, Mario

      I tried repetition using Inline C. To ensure Inline C does not clobber file handles with MCE's IPC handles, I open file handles in Perl and pass file descriptors to C.

      6 workers: 6.2x faster than Algorithm::Combinatorics on machines with 6 real-cores

      use strict; use warnings; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out use Inline 'C' => Config => CCFLAGSEX => '-O2'; use Inline 'C' => <<'END_C'; #include <stdio.h> unsigned long c_repetition (int fd, float p0) { unsigned long c = 0; FILE *stream = fdopen(fd, "wb"); float p1,p2,p3,p4,p5,p6,p7,p8,p9,p10; for (p1=0; p1<=1; p1+=0.2){ for (p2=0; p2<=1; p2+=0.2){ for (p3=0; p3<=1; p3+=0.2){ for (p4=0; p4<=1; p4+=0.2){ for (p5=0; p5<=1; p5+=0.2){ for (p6=0; p6<=1; p6+=0.2){ for (p7=0; p7<=1; p7+=0.2){ for (p8=0; p8<=1; p8+=0.2){ for (p9=0; p9<=1; p9+=0.2){ for (p10=0; p10<=1; p10+=0.2){ //------------- fprintf(stream, "%0.1f\t%0 +.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t1 +.0\t1.0\n", p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10); c++; //------------- } } } } } } } } } } fflush(stream); fclose(stream); return c; } END_C # Run on UNIX machines with 48+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 18 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my $p0 = $_; open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); my $c = c_repetition(fileno($fh), $p0); close $fh; $c_shared->incrby($c); } 0.0, 1.0, 0.2, '%0.1f'; # p0: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); system("cat $tmp_dir/[1-6]; rm -fr $tmp_dir"); printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

      36 workers: use-case for a 64-way box (32 real-cores + 32 hyper-threads)

      This involves nested parallel loops, possible using MCE. The shared-counter variable increments fine no matter how many levels deep. Locking is handled automatically via the OO interface.

      use strict; use warnings; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out use Inline 'C' => Config => CCFLAGSEX => '-O2'; use Inline 'C' => <<'END_C'; #include <stdio.h> unsigned long c_repetition (int fd, float p0, float p1) { unsigned long c = 0; FILE *stream = fdopen(fd, "wb"); float p2,p3,p4,p5,p6,p7,p8,p9,p10; for (p2=0; p2<=1; p2+=0.2){ for (p3=0; p3<=1; p3+=0.2){ for (p4=0; p4<=1; p4+=0.2){ for (p5=0; p5<=1; p5+=0.2){ for (p6=0; p6<=1; p6+=0.2){ for (p7=0; p7<=1; p7+=0.2){ for (p8=0; p8<=1; p8+=0.2){ for (p9=0; p9<=1; p9+=0.2){ for (p10=0; p10<=1; p10+=0.2){ //------------- fprintf(stream, "%0.1f\t%0.1f\ +t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t%0.1f\t1.0\t +1.0\n", p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10); c++; //------------- } } } } } } } } } fflush(stream); fclose(stream); return c; } END_C # Run on UNIX machines with 48+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 18 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my $p0 = $_; MCE::Loop::init { max_workers => 6, chunk_size => 1 }; $tmp_dir .= "/".MCE->chunk_id(); mkdir $tmp_dir; mce_loop_s { my $p1 = $_; open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); my $c = c_repetition(fileno($fh), $p0, $p1); close $fh; $c_shared->incrby($c); } 0.0, 1.0, 0.2, '%0.1f'; # p1: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); } 0.0, 1.0, 0.2, '%0.1f'; # p0: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); system("cat $tmp_dir/$_/[1-6]; rm -fr $tmp_dir/$_") for 1..6; printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

      Results: taken from a 4.2 GHz machine with 8 real-cores, hyper-threads disabled

      Combinatorics : 459.752 seconds 6 workers : 74.394 seconds 36 workers : 58.021 seconds <- my CPU has 8 cores

      Consuming 32 real-cores and a little more is possible on a 64-way box. Afterwards, one may use MCE or a parallel module of choice to process the output file in parallel.

      Disclaimer: My Linux box is tuned to 4.2 GHz on all 8 cores. This is not common. What to take from this is that nested parallel loops is possible with care. On Linux, /dev/shm is beneficial for temporary storage.

      Regards, Mario

        I tried again by removing the overhead associated with fprintf. To ensure Inline C does not clobber file handles with MCE's IPC handles, I open file handles in Perl and pass file descriptors to C.

        6 workers: 25x faster than Algorithm::Combinatorics on machines with 6 real-cores

        use strict; use warnings; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out use Inline 'C' => Config => CCFLAGSEX => '-O2'; use Inline 'C' => <<'END_C'; #include <stdio.h> void c_fput_float(float value, char c, FILE *stream) { static char buf[] = "0.0\n"; int whole = (int) value; int frac = (int) ((value - whole) * 10); buf[0] = '0' + whole; buf[2] = '0' + frac; buf[3] = c; fputs(buf, stream); } unsigned long c_repetition (int fd, float p0) { unsigned long count = 0; FILE *stream = fdopen(fd, "wb"); float p1,p2,p3,p4,p5,p6,p7,p8,p9,p10; for (p1=0; p1<=1; p1+=0.2) { for (p2=0; p2<=1; p2+=0.2) { for (p3=0; p3<=1; p3+=0.2) { for (p4=0; p4<=1; p4+=0.2) { for (p5=0; p5<=1; p5+=0.2) { for (p6=0; p6<=1; p6+=0.2) { for (p7=0; p7<=1; p7+=0.2) { for (p8=0; p8<=1; p8+=0.2) { for (p9=0; p9<=1; p9+=0.2) { for (p10=0; p10<=1; p10+=0.2) { c_fput_float(p0, '\t', stream); c_fput_float(p1, '\t', stream); c_fput_float(p2, '\t', stream); c_fput_float(p3, '\t', stream); c_fput_float(p4, '\t', stream); c_fput_float(p5, '\t', stream); c_fput_float(p6, '\t', stream); c_fput_float(p7, '\t', stream); c_fput_float(p8, '\t', stream); c_fput_float(p9, '\t', stream); c_fput_float(p10, '\t', stream); c_fput_float(1.0, '\t', stream); c_fput_float(1.0, '\n', stream); count++; } } } } } } } } } } fflush(stream); fclose(stream); return count; } END_C # Run on UNIX machines with 48+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 18 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my $p0 = $_; open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); my $c = c_repetition(fileno($fh), $p0); close $fh; $c_shared->incrby($c); } 0.0, 1.0, 0.2, '%0.1f'; # p0: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); system("cat $tmp_dir/[1-6]; rm -fr $tmp_dir"); printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

        36 workers: use-case for a 64-way box (32 real-cores + 32 hyper-threads)

        This involves nested parallel loops, possible using MCE. The shared-counter variable increments fine no matter how many levels deep. Locking is handled automatically via the OO interface.

        use strict; use warnings; die "Not UNIX OS\n" if $^O eq 'MSWin32'; # usage: script.pl > out use Inline 'C' => Config => CCFLAGSEX => '-O2'; use Inline 'C' => <<'END_C'; #include <stdio.h> void c_fput_float(float value, char c, FILE *stream) { static char buf[] = "0.0\n"; int whole = (int) value; int frac = (int) ((value - whole) * 10); buf[0] = '0' + whole; buf[2] = '0' + frac; buf[3] = c; fputs(buf, stream); } unsigned long c_repetition (int fd, float p0, float p1) { unsigned long count = 0; FILE *stream = fdopen(fd, "wb"); float p2,p3,p4,p5,p6,p7,p8,p9,p10; for (p2=0; p2<=1; p2+=0.2) { for (p3=0; p3<=1; p3+=0.2) { for (p4=0; p4<=1; p4+=0.2) { for (p5=0; p5<=1; p5+=0.2) { for (p6=0; p6<=1; p6+=0.2) { for (p7=0; p7<=1; p7+=0.2) { for (p8=0; p8<=1; p8+=0.2) { for (p9=0; p9<=1; p9+=0.2) { for (p10=0; p10<=1; p10+=0.2) { c_fput_float(p0, '\t', stream); c_fput_float(p1, '\t', stream); c_fput_float(p2, '\t', stream); c_fput_float(p3, '\t', stream); c_fput_float(p4, '\t', stream); c_fput_float(p5, '\t', stream); c_fput_float(p6, '\t', stream); c_fput_float(p7, '\t', stream); c_fput_float(p8, '\t', stream); c_fput_float(p9, '\t', stream); c_fput_float(p10, '\t', stream); c_fput_float(1.0, '\t', stream); c_fput_float(1.0, '\n', stream); count++; } } } } } } } } } fflush(stream); fclose(stream); return count; } END_C # Run on UNIX machines with 48+ GiB of RAM. # Otherwise, remove the -use_dev_shm argument. # Beware, consumes 18 GiB in temp dir. use MCE::Signal qw[ $tmp_dir -use_dev_shm ]; use Time::HiRes qw[ time ]; use MCE::Loop; use MCE::Shared; my $start = time; my $c_shared = MCE::Shared->scalar(0); MCE::Loop::init { max_workers => 6, chunk_size => 1 }; # loop through desired combinations mce_loop_s { my $p0 = $_; MCE::Loop::init { max_workers => 6, chunk_size => 1 }; $tmp_dir .= "/".MCE->chunk_id(); mkdir $tmp_dir; mce_loop_s { my $p1 = $_; open my $fh, ">", "$tmp_dir/".MCE->chunk_id(); my $c = c_repetition(fileno($fh), $p0, $p1); close $fh; $c_shared->incrby($c); } 0.0, 1.0, 0.2, '%0.1f'; # p1: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); } 0.0, 1.0, 0.2, '%0.1f'; # p0: seq_beg, seq_end, seq_step, format MCE::Loop::finish(); system("cat $tmp_dir/$_/[1-6]; rm -fr $tmp_dir/$_") for 1..6; printf STDERR "Took: %0.3f seconds [%ld]\n", time() - $start, $c_share +d->get();

        Results: taken from a 4.2 GHz machine with 8 real-cores, hyper-threads disabled

        Combinatorics : 459.752 seconds 6 workers : 18.420 seconds 36 workers : 15.593 seconds <- my CPU has 8 cores

        Consuming 32 real-cores and a little more is possible on a 64-way box. Afterwards, one may use MCE or a parallel module of choice to process the output file in parallel.

        Disclaimer: My Linux box is tuned to 4.2 GHz on all 8 cores. This is not common. What to take from this is that nested parallel loops is possible with care. On Linux, /dev/shm is beneficial for temporary storage.

        Regards, Mario

Re: Parallelization of multiple nested loops
by marioroy (Prior) on Feb 08, 2018 at 12:49 UTC

    Please delete this node. Garbarge collection in Perl is a mystery. I removed the demonstrations so that folks do not run. Regards, Mario

Re: Parallelization of multiple nested loops
by cmk (Initiate) on Apr 18, 2019 at 21:28 UTC

    I was able to produce the output file in just over 3 minutes on a crappy laptop. It's not parallel, and is not as simple as BrowserUKs suggested module, and certainly not as screamingly fast as marioroys. But I wanted to share it in case someone finds it interesting. It should be straightforward to use Getopt::Std to make the values and number levels command line options.

      My understanding:
    • This is similar to a lottery with each ball labeled 0 through 5, divided by 5, and there are 11 balls.
    • The goal is to put all of the approx 362 million combinations (6**11) with values separated by tabs and two 1s appended to each line. I suspect this is just a format that's parsed outside this code portion, but will stick with it. If the same combinations are output as integers 0..5 without the 1's at the end of each line, the file is about half the size.
    #!/usr/bin/perl use warnings; use strict; # values my @val = qw(0.0 0.2 0.4 0.6 0.8 1.0); my $tiers = 11; # map array indices to values my $m = {}; { my $i = int 0; map { $m->{$i++} = $_ } @val; } # first tier my $p = \@val; # create each additional tier skipping first # that's already in $p for (my $i = 2; $i <= $tiers; $i++) { my $tmp; map { $tmp->[$_] = $p; } keys %{$m}; $p = $tmp; } # output file open(my $outfile, '>', '/tmp/output.txt') or die $!; # use recursion to decend the huge matrix # build up the string at each tier my $fn; $fn = sub { my ($aref, $str) = @_; for (my $i = int 0; $i < @{$aref}; $i++) { if(ref($aref->[$i])) { $fn->($aref->[$i], $str."\t".$val[$i]); next; } # end of the line, print last tier of values print $outfile $str."\t".$_."\t1\t1\n" for @val; last; } }; # kick off the recursion, could do these in parallel # at the top-most layer for (my $i = int 0; $i < @{$p}; $i++) { $fn->($p->[$i], $val[$i]); }

    Unsurprisingly the I/O seems to take up a good deal of the time. It's a 17.4GB file with 362797056 lines, but perl only seems to take about 5MB of resident memory (27MB virtual) while running. I certainly wouldn't want to keep the output in memory, but the array-refs would be just fine to pass around.

    • Outputting the full list to file: real 3m10.826s, user 2m34.365s
    • Using 'wc -l' to count the lines in the file took: real 2m51.092s, user 0m3.696s
    • If I comment out the print: real 1m0.890s, user 1m0.775s
    • And finally, just initially generating the array-ref matrix, commenting out calling the recursion: real 0m0.010s, user 0m0.007s, using ddx to print the array-refs pushes it up to 0.034s.

    So depending on what else is being done, and how many times the parameters are changed, it might make sense to just hold the initial huge matrix of array-refs in memory and pull combinations off for further processing in batches.

      Hi, cmk

      Very cool! Seeing my old parallel code made me want to try again for a 100% pure Perl solution not involving Inline::C. So first serial and parallel code afterwards. Workers consume about 16 MB each. The parallel code runs 3 times faster compared to serial. This is made possible by having workers write to STDOUT directly.

      $ time perl serial.pl >/dev/null $ time perl parallel.pl >/dev/null

      Serial Demonstration

      use strict; use warnings; my @vals = qw( 0.0 0.2 0.4 0.6 0.8 1.0 ); sub proc { for my $a ( @vals ) { for my $b ( @vals ) { for my $c ( @vals ) { for my $d ( @vals ) { for my $e ( @vals ) { for my $f ( @vals ) { for my $g ( @vals ) { for my $h ( @vals ) { for my $i ( @vals ) { for my $j ( @vals ) { for my $k ( @vals ) { print "$a\t$b\t$c\t$d\t$e\t$f\t$g\t$h\t$i\t$j\t$k\t1\t1\n"; }}}}}}}}}}} } proc();

      Parallel Demonstration

      use strict; use warnings; use MCE; my @vals = qw( 0.0 0.2 0.4 0.6 0.8 1.0 ); # Must autoflush because workers write to STDOUT directly. STDOUT->autoflush(1); sub proc { my $mce = MCE->new( max_workers => scalar(@vals), chunk_size => 1, init_relay => 1, user_func => sub { my ($a, $b, $c) = @{ MCE->user_args }; my ($buf, $d ) = ( '', $_ ); # $d is the input for my $e ( @vals ) { for my $f ( @vals ) { for my $g ( @vals ) { for my $h ( @vals ) { for my $i ( @vals ) { for my $j ( @vals ) { for my $k ( @vals ) { $buf .= "$a\t$b\t$c\t$d\t$e\t$f\t$g\t$h\t$i\t$j\t$k\t1\t1\n"; }}}}}}} # Relay is driven by the chunk_id value behind the scene. # The benefit is orderly output, one worker at a time. MCE::relay { print $buf }; } )->spawn; for my $a ( @vals ) { for my $b ( @vals ) { for my $c ( @vals ) { # MCE workers persist between each run. The user_args option # is how to pass parameters to them. $mce->process({ input_data => \@vals, user_args => [ $a, $b, $c ], }); } } } $mce->shutdown; } proc();

      Parallel happens at the 4th level to minimize memory consumption.

      Regards, Mario