http://www.perlmonks.org?node_id=1218217


in reply to Re: Parallel::ForkManager and multiple datasets
in thread Parallel::ForkManager and multiple datasets

Hi tybalt89,

That looks like a fun demonstration and gave it a spin. I commented out sleep and changed the input to 'job0001' .. 'job9999'. Then tried MCE implementations, mainly to compare overhead. MCE::Loop and MCE::Flow are wantarray-aware and construct a gather option with corresponding gather function automatically.

chunk_size => 1

#!/usr/bin/perl # https://perlmonks.org/?node_id=1217948 use strict; use warnings; use Data::Dump 'dd'; use Time::HiRes qw( time sleep ); use MCE::Loop; my $maxforks = 20; my @ids = 'job0001' .. 'job9999'; # it is cool that this works my $start = time; MCE::Loop->init( max_workers => $maxforks, chunk_size => 1 ); my @answers = mce_loop { my $id = $_; # sleep rand; my $ret = { id => $id, pid => $$, time => time - $start }; MCE->gather($ret); } \@ids; MCE::Loop->finish; dd \@answers;

chunk_size => 'auto'

#!/usr/bin/perl # https://perlmonks.org/?node_id=1217948 use strict; use warnings; use Data::Dump 'dd'; use Time::HiRes qw( time sleep ); use MCE::Loop; my $maxforks = 20; my @ids = 'job0001' .. 'job9999'; # it is cool that this works my $start = time; MCE::Loop->init( max_workers => $maxforks, chunk_size => 'auto' ); my @answers = mce_loop { my @ret; for my $id ( @{ $_ } ) { # sleep rand; push @ret, { id => $id, pid => $$, time => time - $start }; } MCE->gather(@ret); } \@ids; MCE::Loop->finish; dd \@answers;

Disclaimer. This is comparing apples to oranges because tybalt89's demonstration involves spawning a worker per each element. For the MCE demonstrations, workers request the manager process the next input element(s).

On my laptop (macOS and 9,999 iterations), tybalt89's example takes 16 seconds versus 0.6 and 0.5 seconds respectively for the MCE demonstrations.

Kind regards, Mario

Replies are listed 'Best First'.
Re^3: Parallel::ForkManager and multiple datasets
by tybalt89 (Monsignor) on Jul 12, 2018 at 20:52 UTC

    Hi marioroy,

    As you said, it's apples to oranges, so let's try to be more appleish (or is it orangeish?).

    First attempt was to group job ids so a child does more than one during its lifetime. Turns out to be fairly simple.

    Right after I posted Re: Parallel::ForkManager and multiple datasets, I realized I had written roughly the same forking code several times, so it was time to move it to a module.

    Here's the module. It uses callbacks for the child code and for the parent code that processes the child's returned value.

    package Forking::Amazing; sub run ($&&@) { my ( $maxforks, $childcallback, $resultcallback, @ids ) = @_; use Storable qw( freeze thaw ); use IO::Select; my %fh2id; my $sel = IO::Select->new; while( @ids or $sel->count ) # unstarted or active { while( @ids and $sel->count < $maxforks ) # start all forks allowe +d { my $id = shift @ids; if( open my $fh, '-|' ) # forking open { $sel->add( $fh ); # parent $fh2id{$fh} = $id; } else # child code goes here { print freeze $childcallback->($id); exit; } } for my $fh ( $sel->can_read ) # collecting child data { $sel->remove( $fh ); $resultcallback->($fh2id{$fh}, thaw do { local $/; <$fh> }); } } } 1; __END__ =head1 EXAMPLE program use Forking::Amazing; # small example program use Data::Dump 'dd'; Forking::Amazing::run( 5, # max forks sub { +{id => pop, pid => $$} }, # runs in child sub {dd pop}, # process result of child in pare +nt 'a'..'z'); # ids (one fork for each id) =cut

    The module name may change in the future. :)

    Here's code using that module that does grouping of job ids.
    The id passed to the child is now an anon array of job ids, and a child now returns an anon array of results.

    #!/usr/bin/perl use strict; use warnings;; use Forking::Amazing; use Data::Dump 'dd'; use Time::HiRes qw(time); my $groupsize = 1000; my @rawids = 'job0001' .. 'job9999'; my @ids; push @ids, [ splice @rawids, 0, $groupsize ] while @rawids; my @answers; my $start = time; Forking::Amazing::run 20, sub { [ map +{id => $_, pid => $$, time => time - $start}, @{+shift} + ] }, sub { push @answers, @{+pop} }, @ids; my $end = time - $start; dd \@answers; print "forking time $end\n";

    It's a significant speed up :)

    Note that I moved the dd out of the timing loop, since the dd takes over 1.5 seconds to run on my machine and swamps the forking time.