Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: Parallel::ForkManager and multiple datasets

by tybalt89 (Parson)
on Jul 12, 2018 at 20:52 UTC ( #1218404=note: print w/replies, xml ) Need Help??


in reply to Re^2: Parallel::ForkManager and multiple datasets
in thread Parallel::ForkManager and multiple datasets

Hi marioroy,

As you said, it's apples to oranges, so let's try to be more appleish (or is it orangeish?).

First attempt was to group job ids so a child does more than one during its lifetime. Turns out to be fairly simple.

Right after I posted Re: Parallel::ForkManager and multiple datasets, I realized I had written roughly the same forking code several times, so it was time to move it to a module.

Here's the module. It uses callbacks for the child code and for the parent code that processes the child's returned value.

package Forking::Amazing; sub run ($&&@) { my ( $maxforks, $childcallback, $resultcallback, @ids ) = @_; use Storable qw( freeze thaw ); use IO::Select; my %fh2id; my $sel = IO::Select->new; while( @ids or $sel->count ) # unstarted or active { while( @ids and $sel->count < $maxforks ) # start all forks allowe +d { my $id = shift @ids; if( open my $fh, '-|' ) # forking open { $sel->add( $fh ); # parent $fh2id{$fh} = $id; } else # child code goes here { print freeze $childcallback->($id); exit; } } for my $fh ( $sel->can_read ) # collecting child data { $sel->remove( $fh ); $resultcallback->($fh2id{$fh}, thaw do { local $/; <$fh> }); } } } 1; __END__ =head1 EXAMPLE program use Forking::Amazing; # small example program use Data::Dump 'dd'; Forking::Amazing::run( 5, # max forks sub { +{id => pop, pid => $$} }, # runs in child sub {dd pop}, # process result of child in pare +nt 'a'..'z'); # ids (one fork for each id) =cut

The module name may change in the future. :)

Here's code using that module that does grouping of job ids.
The id passed to the child is now an anon array of job ids, and a child now returns an anon array of results.

#!/usr/bin/perl use strict; use warnings;; use Forking::Amazing; use Data::Dump 'dd'; use Time::HiRes qw(time); my $groupsize = 1000; my @rawids = 'job0001' .. 'job9999'; my @ids; push @ids, [ splice @rawids, 0, $groupsize ] while @rawids; my @answers; my $start = time; Forking::Amazing::run 20, sub { [ map +{id => $_, pid => $$, time => time - $start}, @{+shift} + ] }, sub { push @answers, @{+pop} }, @ids; my $end = time - $start; dd \@answers; print "forking time $end\n";

It's a significant speed up :)

Note that I moved the dd out of the timing loop, since the dd takes over 1.5 seconds to run on my machine and swamps the forking time.

Replies are listed 'Best First'.
Re^4: Parallel::ForkManager and multiple datasets
by marioroy (Vicar) on Jul 13, 2018 at 03:24 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1218404]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2020-02-20 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (92 votes). Check out past polls.

    Notices?