Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

fixed set of forked processes

by anonymized user 468275 (Curate)
on Dec 02, 2010 at 17:26 UTC ( #874956=perlquestion: print w/replies, xml ) Need Help??

anonymized user 468275 has asked for the wisdom of the Perl Monks concerning the following question:

I followed a previous discussion and as a result read perlipc. There is a part of my requirement that doesn't seem to be covered there. I have 150000 jobs defined in a language called jil in a single file that I want to load individually through a utility. Some of the jobs will have syntax errors and some will fail but could succeed on retry. So I wrote a parser in OO-Perl that gets an individual job from the file. Now I am designing a method called spawn_jil which should create a subprocess e.g. with fork or open to load the jobs in parallel; but if a certain number of subprocesses have not yet exited it has to wait until less than that number of subprocesses are still alive. So my thought processes went something like: suppose the parent creates a filehandle for the child to communicate back (use Filehandle) before spawning. It puts it in a hash and after iterating through the parser it has to destroy filehandles for exited subprocesses. I don't see a method for that in the Filehandle package. The child can't create the filehandle because the parent then won't have access to it. I can't use the database because the overhead would defeat the whole purpose. Ideally I'd like to avoid communicating via a storable. Any ideas? Thanks

One world, one people

Replies are listed 'Best First'.
Re: fixed set of forked processes
by derby (Abbot) on Dec 02, 2010 at 18:33 UTC

      One world, one people

Re: fixed set of forked processes
by sundialsvc4 (Abbot) on Dec 02, 2010 at 18:25 UTC

    I would approach this sort of problem by defining a fixed and configurable (small) number of threads, all of which are built to do the same thing:   to read a work-request from a single queue (e.g. Thread::Queue::Duplex), perform the unit of work (in an eval{} block), and write a response-record to the same or to a different queue.

    All of the threads, no matter how many there are, are reading and writing from the same queues.   So, when a record is written to “the request queue,” no one really cares which thread winds up picking-up the request and running it.

    The threads, in turn, are built to survive.   Any runtime error that may occur during processing is absorbed, and a record of that event is merely added to the response-record for someone else down the line to deal with.

    To avoid too-much competition for the “single file,” you might dedicate one thread to the task of reading a block of records from the file and shoving them into the request queue.   By some appropriate means, let the thread snooze until the number of enqueued items drops below some threshhold, at which time it reads a few more records from the file to recharge the queues.

    In this way, the jobs are indeed “processed in parallel,” but you maintain control over the attempted multiprogramming-level at all times.   Such a system could perform work at a predictable and steady rate no matter how many jobs ultimately needed to be run.   The size of that file would not affect the rate at which work was carried out; only the amount of wall-time required to do it.

      Please don't suggest the use of Thread::Queue::Duplex until you've used it, and therefore encountered its limitations.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      hmm although I prefer fork to thread so as to surivive later versions of perl, this does give me an idea of how to implement my own threads using fork in a way that overcomes my filehandle problem with the standard drop-in solution:-

      Since I know in advance I am going to use the max configured subprocesses given that there are 150000 jobs in the queue being rapidly thrown at my scheduling architecture, I could start by forking precisely that number of subprocesses using open |- and let the children live to the very end, sending the code they have to manage over the pipe.

      update: but then whether I do that or use your queued thread approach, I need also to read back from the child in order to perform complicated load-balancing. If the subprocesses are allowed to die per iteration (i.e per job parsed and submitted to a child or thread) I wouldn't have that problem

      One world, one people

Re: fixed set of forked processes
by Illuminatus (Curate) on Dec 02, 2010 at 17:57 UTC
    I think a few more details are in order
    1. How will the jobs actually be run? You mention fork, but not exec or system
    2. If you want to communicate between parent and child, you probably don't want file-based IO. You would probably want IO::Socket instead. Use socketpair to create 2-way communication
    3. If you aren't going to exec the child into the jil job, you are probably better off using threads. Then you can simply use shared data to communicate and manage the children


      It would be IPC:: open3 or run3 the child will then handle the results of it's own spawning. The parent only needs to know whether the child is still alive. So the communication is a bit fake -- Parent only wants to know if closing the IPC filehandle to the child fails, signifying that the child has exited -- all this assuming there isn't another way to know if the child lives.

      update: or rather closing the filehandle is what perlipc suggests but what if I need to poll repeatedly for exited?

      update: socketpair is just a wrapper to what I already discussed. It complicates the question of how to create the multiple filehandles and still leaves the question of how to destroy them.

      One world, one people

        have you looked at threads? You can create the threads, then use is_running to check if they are alive. If you need to pass actual data, you can use the aforementioned shared data.


Re: fixed set of forked processes
by salva (Canon) on Dec 02, 2010 at 17:58 UTC
    What kind of data do you want to pass back from the children to the parent?

    Is it just some boolean indicating failure/success, a line of text or some complex data structure?

      no data. not even failure or success, child capable of handling its own results. But the parent does need to know how many children are left running and update to OP: I also would prefer to avoid spawning a unix grep of unix ps to count subprocesses - that is also an unwanted overhead.

      One world, one people

        Your parent process will know the PID of each child process, so you can kill 0, $pid to see if it's running. Not sure what the performance implications are.

        Since you were mentioning ps & grep, I thought I'd mention this simple alternative.

        If all you want is to know if children are alive or dead, you might want to look into $SIG{CHLD} which tells you when kids die. That's in perlipc too.


Re: fixed set of forked processes
by anonymized user 468275 (Curate) on Dec 08, 2010 at 18:12 UTC
    My thanks to all who offered advice. Here is the solution now put into practice (most of the irrelevant code omitted). Note I had to use a homegrown alternative to FileHandle because its documentation didn't mention support of Open3. That was just an array of self-numbering filehandles for easy deletion.
    package jiloader; use Parallel::ForkManager; use IPC::Open3; use Fcntl qw(:flock SEEK_END); sub new { ... $opt { MAXFORKS } ||= 31; $opt{ PM } = new Parallel::ForkManager( $opt{ MAXFORKS } ); $opt{ PMFH } = []; # filehandle pool ... # example of one of four files that track things unlink $opt{REJFILE}; open my $rejh, ">>$opt{REJFILE}" or die "$!: $opt{REJFILE}\n"; $opt{ REJH } = $rejh; ... bless \%opt; } ... sub put { # batch changes by outer box # for submission to fork scheduler my $self = shift; if ( $self -> { REST } ) { if ( $self -> { TOPBOX } ) { $self -> { BATCH } and $self -> sched; } $self -> { BATCH } .= $self -> { CHG }; } else { $self -> { BATCH } .= $self -> { CHG }; $self -> sched; } } sub sched { my $self = shift; my $pm = $self -> { PM }; my $rh = $self -> getfh; my $wh = $self -> getfh; my $eh = $self -> getfh; # parent has allocated fh's so has to # free them when child exits # child cannot do this whatever the fh pooling solution $pm -> run_on_finish( sub { $self -> killfh( $rh, $wh, $eh ); +} ); unless( $pm -> start ) { open3 $wh, $rh, $eh, $self -> { JILCOMMAND }; print $wh $self -> { BATCH }; close $wh; unless ( $self -> jiloutparse( $rh, $eh ) ) { my $errh = $self -> { ERRH }; flock $self -> { ERRFILE }, LOCK_EX; print $errh $self -> { LASTOUT }; print $errh $self -> { LASTERR }; flock $self -> { ERRFILE }, LOCK_UN } close $rh; close $eh; $pm -> finish; } $self -> { BATCH } = ''; } ...

    One world, one people

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://874956]
Approved by salva
Front-paged by pileofrogs
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2021-12-01 03:56 GMT
Find Nodes?
    Voting Booth?

    No recent polls found