Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Help with multiple forks

by Anonymous Monk
on May 30, 2012 at 14:47 UTC ( #973304=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is sort of a follow-up on a previous post asking for help with forks. I've been trying to learn, but I'm still very confused. I *think* I understand the basic idea, but the implementation still escapes me.

RichardK suggested I might be better off describing what I want to achieve so someone might make a suggestion as to the approach or module to use. So here it is in, pseudo-kind-of-code:

my @array1 = (1..5); # N=5 my @array2 = qw/a b/; for my $val1 (@array1) { # Start N processes here that can run in parallel. # Each process outputs data to its own separate file. # I will need it in the future. for my $val2 (@array2) { # For each of the N processes, wait until it is done, # then start 2 parallel processes which use # the output data as input. # Save the output of each process separately to 2N files # (two files with N elements would be better, but as I # couldn't figure that out, I just postprocess the data ;-) } } # "waitallchildren" or equivalent # (Postprocess step to reduce the final output to 2 files # - that I can do)

And here is my second attempt, using another module, but which still doesn't work. The outer part does, but as soon as I uncomment the inner part, my prompt doesn't return anymore. No idea what's going on.

use Proc::Fork; for my $val1 (@array1) { run_fork { child { open FILE, ">$val1.txt"; print FILE "Output of step 1\n"; close FILE; } parent { my $child_pid_outer = shift; waitpid $child_pid_outer, 0; # for my $val2 (@array2) { # run_fork { # child { # open FILE1, "$val1.txt"; # open FILE2, ">$val2$val1.txt"; # while (my $line = <FILE1>) { # $line =~ s/1/2/; # print FILE2 $line . "\n"; # } # close FILE1; # close FILE2; # } # parent { # my $child_pid_inner = shift; # waitpid $child_pid_inner, 0; # } # Parent inner loop # }; # Inner fork # } # For loop } # Parent outer fork }; # Outer fork }

I'm currently trying to read the documentation to other modules, but it's not easy-going... I really don't understand much, so I'd very happy if someone could put me on the right track.

Replies are listed 'Best First'.
Re: Help with multiple forks
by Eliya (Vicar) on May 30, 2012 at 16:37 UTC

    Getting back to Parallel::ForkManager, which you attempted to use in your first post, you might want to try

    use Parallel::ForkManager; my $fm1 = new Parallel::ForkManager(5); for my $val1 (1..10) { $fm1->start and next; # $0 = "processing step 1: $val1"; sleep 10; open FILE, ">", "$val1.txt" or die $!; print FILE "Output of step 1\n"; close FILE; my $fm2 = new Parallel::ForkManager(2); for my $val2 (qw/a b/) { $fm2->start and next; # $0 = "processing step 2: $val1/$val2"; sleep 5; open FILE1, "<", "$val1.txt" or die $!; open FILE2, ">", "$val2$val1.txt" or die $!; while (my $line = <FILE1>) { $line =~ s/1/2/; print FILE2 $line . "\n"; } close FILE1; close FILE2; $fm2->finish; } $fm1->finish; }

    This would run max 5 processes at the outer level, plus max 2*5 at the inner level, i.e. max 15 processes total (10 inner from the last round + 5 outer for the next round).  I'm not entirely sure that's what you want, but it's at least something to play with...

    P.S. If you uncomment the $0 = ... lines, you can grep for "processing" in the ps output to observe what's going on.

      Thanks a lot! That's exactly what I was trying to write. I'm a little ashamed because it seems I should have been able to come up with it myself.

      There is one funny thing, though: If I comment out the "for my $val2 (@array2) {..." and the corresponding closing curly bracket (and change $val2 to a constant like "step2"), it breaks with Cannot start another process while you are in the child process at .../perl/lib/perl5/Parallel/ line 463. Not that I would want that - in my case I need the two nested loops. It's just that I don't know what is happening there. I mean, in the nested loop, I would have assumed I was in the 1st child process, which is itself the parent of the 2nd child process, right?
Re: Help with multiple forks
by ikegami (Pope) on May 30, 2012 at 16:18 UTC
    You do "For each of the N processes, wait until it is done" N times, which means you wait for N*N processes when there are only N. Your pseudo code is flawed. As a result, I'm not sure what you are trying to do, but I think those loops shouldn't be nested.
Re: Help with multiple forks
by kennethk (Abbot) on May 30, 2012 at 16:34 UTC
    I don't have Proc::Fork installed on my box, but using just ordinary fork (and swapping to Indirect Filehandles to avoid possible global collision issues) the following code generates 4 files that contain "Output of step 1" and 16 files that contain "Output of step 2". Assuming this is the intended result, this should hopefully be a helpful guide toward your real use case.
    use strict; use warnings; my @array1 = (1..4); my @array2 = (1..4); for my $val1 (@array1) { my $outer_pid = fork(); die "fork failed ($val1)" unless defined $outer_pid; if ($outer_pid == 0) { open my $fh, ">", "$val1.txt" or die "Open fail: $!"; print $fh "Output of step 1\n"; exit 0; } else { waitpid $outer_pid, 0; for my $val2 (@array2) { my $inner_pid = fork(); die "fork failed ($val1,$val2)" unless defined $inner_pid; if ($inner_pid == 0) { open my $fh1, '<', "$val1.txt" or die "Open fail: $!" +; open my $fh2, ">", "$val2$val1.txt" or die "Open fail +: $!"; while (my $line = <$fh1>) { $line =~ s/1/2/; print $fh2 $line . "\n"; } exit 0; } else { waitpid $inner_pid, 0; } } } }

    Update: It occurs to me that you likely don't want a blocking wait on your children, since this is specifically not generating simultaneous workers. You have a potential race condition in that scenario, since generating the first set of files may take more time than is required to get to generating the second set. You can resolve this in classic style using flock. You can also just ignore the problem of reaping kids using local $SIG{CHLD} = 'IGNORE';. The following code includes a 1 second sleep in the first loop to demonstrate blocking, and generates 21 simultaneous processes at max:

    use strict; use warnings; use Fcntl ":flock"; my @array1 = (1..4); my @array2 = (1..4); local $SIG{CHLD} = 'IGNORE'; for my $val1 (@array1) { my $pid = fork(); die "fork failed ($val1)" unless defined $pid; if ($pid == 0) { open my $fh, ">", "$val1.txt" or die "Open fail: $!"; flock($fh, LOCK_EX) or die "Cannot lock $val1.txt - $!\n"; print $fh "Output of step 1\n"; sleep 1; flock($fh, LOCK_UN) or die "Cannot unlock $val1.txt - $!\n"; exit 0; } } for my $val1 (@array1) { for my $val2 (@array2) { my $pid = fork(); die "fork failed ($val1, $val2)" unless defined $pid; if ($pid == 0) { open my $fh1, '<', "$val1.txt" or die "Open fail: $!"; flock($fh1, LOCK_SH) or die "Cannot lock $val1.txt ($val2) + - $!\n"; open my $fh2, ">", "$val2$val1.txt" or die "Open fail: $!" +; while (my $line = <$fh1>) { $line =~ s/1/2/; print $fh2 "$line\n"; } flock($fh1, LOCK_UN) or die "Cannot unlock $val1$val2.txt +- $!\n"; exit 0; } } }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Help with multiple forks
by mbethke (Hermit) on May 30, 2012 at 17:40 UTC

    To avoid race conditions I'd suggest to delegate the starting of the two postprocessing children to the child itself, i.e. make them grandchildren. Fork off the function that writes the file, and as the last step fork again twice to produce the two children that will consume the output.

    This would also facilitate things if you wanted to eliminate the temporary files. If it's possible to stream data to the two grandchildren, you could just "open my $kid, '|-"" a pipe to them and write to that instead of the file that you're going to read again later.

      Just to make sure I got it: This is what Eliya does using Fork::Manager, right?

      My next step is to learn about pipes :-)

        Yes, “learn about pipes” because this would make your job a helluva lot simpler.   Start a pool of child-processes that read from a pipe and do the work that has been given to them by means of that pipe.   (When the pipe is closed by the writer, the children’s read-requests fail and when this happens they terminate themselves.)

        Likewise, instead of “starting” the second-stage processes when the first stage has finished, have the first-stage processes write messages to a second pipe that is listened-to by the second-stage processes which are built using the same design.   After the first-stage processes consume their work and die-off, the second-stage processes in turn consume their work and die, and so on, until the parent finally realizes that all of its children have died (as expected) and it then terminates.

        Now, all of the processes (regardless of their role) do their initialization and termination only once, and perform their jobs as quickly as they are able, and the pipes take up the slack.   You tune the behavior of the system for maximum throughput by tweaking the number of processes that you create, and they perform work at that constant rate no matter how full or how empty the pipes may be.

        Think:   production line.

        Edit:   Responding if I may to BrowserUK’s not-so-Anonymous reply (and his exceedingly discourteous but not-unexpected downvote) to the above ... kindly notice that most multiprogrammed systems are and always have been built around the notion of a limited (but variable) number of persistent worker processes that produce and consume work from a flexible queue of some kind.   Even in the earliest days of computing, when hulking IBM mainframe computers barely had enough horsepower to get out of their own way, their batch-job processing engines and interactive systems (e.g. CICS) had and still do have this essential architecture.   The reason why is quite simple:   you can tune it readily (just by adjusting the number of workers and/or their handling of the queues), and it performs at a predictable sustained rate without over-committing itself.   The queues absorb the slack.   Such an arrangement naturally conforms itself to, for example, computing clusters, and it gracefully supports the adding and removing and re-deployment of computing resources.

        “Over-committing” a system produces performance degradation that becomes exponential after a period of time in which it is linear, a harrowing phenomenon called (politely) “hitting the wall.”   The curve has an elbow-shaped bend which goes straight up (to hell).   For instance, I once worked at a school which needed to run computationally-expensive engineering packages on a too-small machine.   If one instance was running, it took about 2 minutes; with five, about 4. But with seven, each one took about 18 minutes and it went downhill from there ... fast.   A little math will tell you that the right way to get seven jobs done in 6 minutes (on average) is to allow no more than five to run at one time.   It worked, much to the disappointment of the IBM hardware salesman.   The rest sit in a queue, costing nothing for the entire time they sit there not-yet-started.   Likewise, a queue-based architecture will consistently deliver x results-per-minute at a sustained rate even if there are larger-y pieces of work to be performed.   A thread (or process) is not a unit-of-work.

        Just to make sure I got it: This is what Eliya does using Fork::Manager, right?
        Exactly, there's your implementation already---sorry, I had only read kennethk's response when I replied.
Re: Help with multiple forks
by Neighbour (Friar) on May 31, 2012 at 08:46 UTC
    Looking at your pseudocode, perhaps Forks::Super is something you could use here:
    sub OnTaskEnd { my ($forksuperjob, $jobid) = @_; if (ref($forksuperjob) ne 'Forks::Super::Job') { die("OnTaskEnd called with argument of type [" . ref($forksupe +rjob) . "] instead of expected Forks::Super::Job"); } print("TaskEnd: JobID [$jobid] Job pid [" . $forksuperjob->{real_p +id} . "] starttime [" . int($forksuperjob->{start}) . "] name [" . $f +orksuperjob->{name} . "] status [" . $forksuperjob->{status} . "]\n") +; # Do other stuff here, like starting new jobs to process files $forksuperjob->dispose; } foreach my $val1 (@array1) { my $forkresult = Forks::Super::fork { dir => 'dir_here', cmd => '', name => 'unique_name_of_task', callback => { start => \&OnTaskStart, finish => \&OnTaskEnd }, }; } print("Waiting for child jobs to finish\n"); my $jobs_waited = waitall(); print("[$jobs_waited] jobs were scheduled/running and have now finishe +d\n");
    Alternatively, instead of executing external programs with the cmd-option, you could use sub to fork subroutines instead.

      Thanks. I hadn't noticed the package in CPAN because it's not common to find the package you want so far down the list. However in this case I don't need to limit the number of processes, so I will forgo both Parallel::ForkManager and Forks::Super

      So in the end, I am opting for plain standard fork as in the 1st example from kennethk, but, as suggested by mbethke with the inner loop and fork moved to the child of the outer fork (which is what happens in Eliya's example, I think)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://973304]
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2018-05-26 03:27 GMT
Find Nodes?
    Voting Booth?