Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Introduction to Parallel::ForkManager

by biosysadmin (Deacon)
on Sep 14, 2003 at 21:37 UTC ( #291446=perltutorial: print w/ replies, xml ) Need Help??

Introduction to Parallel::ForkManager

Introduction

The goal of this tutorial is to demonstrate how to use Parallel::ForkManager, a simple and powerful Perl module available from CPAN. Parallel::ForkManager is a simple and powerful module that can be used to perform a series of operations in parallel within a single Perl script. It is especially well-suited to performing a number of repetitive operations on a relatively powerful machine, especially when working on a multiprocessor machine. This module uses object-oriented syntax, if that frightens you then should read some of the Object Oriented Perl tutorials.

Usage

One caveat to using Parallel::ForkManager is that you must instantiate the Parallel::ForkManager object with a number representing the maximum number of processes to fork. Here is an example of the syntax:

my $manager = new Parallel::ForkManager( 20 );

In many cases, this maximum number of processes to fork will also be the actual number of processes forked by your program. In this case, it is very important to choose this number carefully, as forking a large enough number of processes is enough to bring even the mightiest of machines to it's knees. Also, you can change this number later in your program as needed with the following method:

$manager->set_max_procs( $newMaximumProcs );

After instantiating a Parallel::ForkManager object, you can start forking processes using the start method. It is important to also define the point at which the child processes will finish. This is usually performed within a for or while loop, so the syntax will look like this:

foreach my $command (@commands) { $manager->start and next; system( $command ); $manager->finish; };

The line within the for loop is a common idiom used for Parallel::ForkManager, it starts running the command via a forked process and advances to the next command in the @command array. The start method takes an optional parameter named $process_identifier, which can be used in callbacks (see Callbacks section).

Another useful method in the Parallel::ForkManager class is the wait_all_children method. It performs a blocking wait on the parent program that waits until all forked processes have finished.

Callbacks

It is possible to define callbacks to child processes, which are blocks of code that are called at various points of the execution of your processes. There are three forms of callbacks:

  • run_on_start - run when each process is started
  • run_on_finish - run when each process is finished
  • run_on_wait - run when a process needs to wait for startup
Callbacks are defined using the run_on_start, run_on_finish, and run_on_wait methods, which take subroutines (or references to subroutines) as arguments. The arguments provided to the subroutine differ depending on which form of callback you are defining.

Here's an example of the run_on_start method:

$manager->run_on_start( sub { my ($pid,$ident) = @_; print "Starting processes $ident under process id $pid\n"; } );

The arguments passed to the run_on_start sub are the process id of the forked process (provided by the operating system), and an identifier for the process that can be defined in the start method of the Parallel::ForkManager process. You should remember this in case that you don't provide an identifier in the call to start, this will make $ident be undefined and cause the Perl interpreter to complain (if you are using strict and warnings).

Here's an example of the run_on_finish method:

$manager->run_on_finish( sub { my ( $pid, $exit_code, $ident, $signal, $core ) = @_; if ( $core ) { print "Process $ident (pid: $pid) core dumped.\n"; } else { print "Process $ident (pid: $pid) exited print "with code $exit_code and signal $signal.\n"; } } );

This callback prints useful messages upon completion of the process. One caveat is that $ident must be defined in the start method of each process for this to work, otherwise this code needs to be modified.

The run_on_wait subroutine is a bit different. It is called when the Parallel::ForkManager object needs to wait for something, such as waiting for startup, starting, and waiting for processes to exit. It takes both a subroutine (or subroutine reference) and a optional argument $period, which defines the number of seconds to wait before calling the method again. Here's an example of it's usage:

$manager->wait_on_finish( sub { print "Waiting ... \n"; }, 3 );

This example prints its message about every 3 seconds. In the notes for the latest version of Parallel::ForkManager, it says that the exact period of time is not guaranteed and can vary slightly according to system load. If the second argument is not provided, then the subroutine will be called after the appropriate wait during the start and wait_on_children methods.

Bugs and Limitations

These are straight from the Parallel::ForkManager perldoc, three caveats are provided:
  • "Do not use Parallel::ForkManager in an environment, where other child processes can affect the run of the main program, so using this module is not recommended in an environment where fork() / wait() is already used."
  • "If you want to use more than one copies of the Parallel::ForkManager, then you have to make sure that all children processes are terminated, before you use the second object in the main program."
  • "You are free to use a new copy of Parallel::ForkManager in the child processes, although I don't think it makes sense."

Other Resources

One of the most valuable sources of information on this module is the Perldoc formatted, documentation is available on systems that have Parallel::ForkManager installed and from CPAN.

Comment on Introduction to Parallel::ForkManager
Select or Download Code
Re: Introduction to Parallel::ForkManager
by mildside (Friar) on Sep 14, 2003 at 23:14 UTC
      You are correct, I have used the module just as described in the referenced node. I'm planning on writing up a synopsis of what I had to do, how I did it, and how it all works and stick it under craft.

      However, since you asked here's a short answer. I did use Parallel::ForkManager to solve my problem, I followed the following suggestions from the thread:
      • I uncompressed each file to a temporary location (not /tmp) and then deleted it after the reformatting was through.
      • I made sure that the temporary location was on a different physical drive from the permanent location for the formatted files.
      • I rewrote my program to divide the work into 4 distinct sets, and then forked a process for each set.
      • I switched to the GNU version of gzip because of this reply (the GNU version is supposedly faster).
      • I staggered the initiation of the threads by sleeping for 60 seconds to try and avoid competition for disk I/O.
      There's a bit more about the optimization that I've done, but look for it under Craft in the next few weeks.
Re: Introduction to Parallel::ForkManager
by Molt (Chaplain) on Sep 15, 2003 at 10:35 UTC

    Very nice description of one of my favourite and most-used modules there.

    The only thing I'd add to your description is just how good this approach is when handling a lot of batchfile work on multiprocessor machines.. when I had a program doing the same rather intensive thing to over 20,000 large files I found that doing the ForkManager approach really screamed compared to the single process method, and without the nastiness of poking Fork() myself.

Re: Introduction to Parallel::ForkManager
by neilwatson (Curate) on Sep 15, 2003 at 17:12 UTC
    I would followup with this warning. When determining the maximun number of forked processes, please test carefully. The ideal situation is to test on a non critcal machine. Start with a small number and work up. Monitor performance closely.

    Using the Parallel ForkManager can greatly increase the speed of your scripts. However, without proper testing it can bring a machine to its knees. I speak from experience.

    Neil Watson
    watson-wilson.ca

Re: Introduction to Parallel::ForkManager
by carric (Beadle) on Nov 10, 2005 at 06:36 UTC
    I have been trying to figure out how to multi-thread this some portscans for my job, and everything I tried failed miserably. This worked like a CHAMP!! Thanks a lot.
Re: Introduction to Parallel::ForkManager
by metaperl (Curate) on Nov 10, 2005 at 21:33 UTC
    A very timely post. I would like to know how this module compares with Event, Event::Lib, and POE.
Re: Introduction to Parallel::ForkManager
by listanand (Sexton) on Aug 07, 2009 at 17:33 UTC
    Thanks for the post.

    I am new to Perl, and want to make sure I understand this right. When we say "$manager->start and next;" , doesn't what is below this statement (rest of the for loop that is) get skipped completely? I am having a hard time understanding at what point (and precisely how) the child process is spawned, where does it end, and how the code actually executes. Can someone please explain this a bit more? The way I see it, it seems like all these child processed are spawned but nothing happens after that since we are out of the for loop already!

    Thanks in advance.

      When we say "$manager->start and next;" , doesn't what is below this statement (rest of the for loop that is) get skipped completely?

      The and isn't just a "do this then this", it's a shortcut operator. If the $manager->start evaluates to something true, it does the next, but otherwise it doesn't.

      In the particular case of Parallel::ForkManager, the ->start method returns values just like fork does; in the parent, it returns the pid of the child (which is a positive integer, and thus true), and in the child, it returns 0 (which is false).

      So, the result is that in the parent process, the next happens, and it goes around and spawns off the next one (which is what you want the parent to do). In the child, since the ->start returns a false value, the and isn't followed, and it goes ahead and does the bits of actual work. The child does its thing (with system in this case), and then calls the ->finish method, which is equivalent to exit, so the child doesn't go back to the top of the loop and try spawning off more children (that's the parent's job).

        Very clear and precise. Thank you very much !
Re: Introduction to Parallel::ForkManager
by sierpinski (Hermit) on May 14, 2010 at 13:49 UTC
    One caveat that I encountered when using this module was when I was using it along with Net::SSH::Expect to connect to a list of servers and run monitoring commands (checking for failed disks, full filesystems, etc). We use LDAP for authentication, and found that when running my script during the work day when everyone was here, it would have multiple failures connecting to servers, and I could never figure out why. Finally I realized when I ran it at night, it would always have 100% success rate. It turns out (and I had this verified with our LDAP team) that the forked processes would saturate the LDAP servers for requests, and some would end up failing.

    I thought it interesting that even though my maxprocs was set to something like 20 or 25, it just couldn't process them fast enough to make ssh happy.

Re: Introduction to Parallel::ForkManager
by Anonymous Monk on Nov 04, 2011 at 12:41 UTC

    Hi I am using the parallel fork manager module to start 4 processes in parallel.I want the output of those 4 processes for doing some operations like writing into a excel/text file. I assigned the output to variables but i am unable to acess those variables after pm->finish command(out side the for loop). please explain me the simple way to retreive data from child processes so that i can acess those output even after the child process exits. I have gone through that cpan module but didnt get exactly what to do

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://291446]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2014-07-26 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (178 votes), past polls