parallel process on remote machines,read results and hanle timeout of those process

x12345 has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl experts,

I would like to learn to do the following things with perl:
--open parallel process on a lot of remote machines,
--get back the result,
--and kill the process which cannot succeed.

I already have the scripts from another person to do this function, but I don't undertand well, so I am here.

The senario is:
--There are 1000 linux machines, on each machine there is already a shell script used to check the machine's memory, disk.. etc.
--A perl script on a server uses 'open pipe', it opens 1000 filehandel to go to each machine, runs the shell script and brings back the results. The shell script normally needs just 5-15s to finish, so I can get back the results soon. But sometimes teh shell script can be stucked because of the problem on the machine(eg, if disk problem, df command will stay there for ages), in this case, I should close the filehandle after waiting for some times eg,300s

My two main questions are:

1.In the script I have, it sets all the filehandels to non_blocking and use 'sysread' to read the output. Why use non-blocking? What about use while (<FH>) {push @results,$_;}. What is the difference?

2.How to do the timeout for the filehandle?

Thanks in advance!

Comment on parallel process on remote machines,read results and hanle timeout of those process Download Code

Replies are listed 'Best First'.
Re: parallel process on remote machines,read results and hanle timeout of those process by BrowserUk (Patriarch) on Oct 31, 2014 at 02:01 UTC
In the script I have, it sets all the filehandels to non_blocking and use 'sysread' to read the output. Why use non-blocking? What about use while (<FH>) {push @results,$_;}. What is the difference? If you go the blocking route, and the first blocking handle you attempt to read from fails to respond, you'll block forever. Even if it eventually responds, you wasted a lot of time waiting for that machine when you could have been reading the responses from other machines that respond more quickly. By going the non-blocking route, you will get the data from whichever machine responds quickest, as soon as it is available, and thus minimise the overall time required to gather all the data. The downside is that you have to read, in set & small chunks, and reassemble the output yourself. How to do the timeout for the filehandle? Basically, you only need one timeout. With non-blocking handles, you can fire off the commands to all the machines without waiting for any of them. You then start your timer (record the current time). Then each time the select loop fires, because data is available, you read that data and add it to the buffer for the appropriate handle. Then you check how long the loop has been running and if it has exceeded your timeout, quit the select loop and close all your handles. This means that the first machine you send the command to will have had very slightly longer to respond than the last, but as you didn't wait for the responses until after recording your start time, the difference will be minimal; and will actually mean that you gave most of the machines a few milliseconds longer than required. This should not be a problem. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: parallel process on remote machines,read results and hanle timeout of those process by x12345 (Novice) on Oct 31, 2014 at 10:50 UTC
Thanks for your explaination. It helps me understannd better. That is why in the script, it reads 1024 bytes each time in a loop for each filehandel. For non-blocking filehandel, it has to be done by this kind of chunk-reading. And using "while (<FH>)", is more for blocking filehandle: I mean, if "Open" not failed, for sure, you can read all lines from the filehandle. My third question is about " EAGAIN() and retry'. I didn't undersntand this part of the code: $hl->{$_}->{retry} = 0; $hl->{$_}->{retries} = 0; my $start = time; my $blocksize = 1024; while (scalar keys %hltodo) { machine: for (keys %hltodo) { my $out = $hl->{$_}->{chld_out}; # begin to read my $bytes_read = -1; while ($bytes_read) { my $buf; my $bytes_read = sysread($out, $buf, $blocksize); if (defined($bytes_read)) { if ($bytes_read == 0) { # eof close($out); last; } else { $hl->{$_}->{data}.= $buf; } } else { if ($! == EAGAIN()) { # retry $hl->{$_}->{retry}++; $hl->{$_}->{retries}++; if ($hl->{$_}->{retry}) { $hl->{$_}->{retry} = 0; next machine; } usleep 10; } else { last; } } } delete $hl->{$_}->{"chld_out"}; delete $hltodo{$_}; } # kill remaining pids if timeout reached if ($opt{timeout} && time > $start + $opt{timeout}) { print STDERR "Timeout for: ", join (" ", keys %hltodo), " +killing ", join (" ", values %hltodo) ; kill 1, values %hltodo; %hltodo = (); } } [download] When the non-blocking filehandel is blocked dur to what ever the reason,it send$! to EAGAIN, then $hl->{$_}->{retry}++ will be 1, so it goes to " $hl->{$_}->{retry} = 0" and "next machine", ti will never do usleep 10 microsecond? I must miss something for this part?	[reply] [d/l]
Re^3: parallel process on remote machines,read results and hanle timeout of those process by BrowserUk (Patriarch) on Oct 31, 2014 at 12:55 UTC
EAGAIN means that whilst there is something available on the socket, hence select has given you it, that at the exact moment you tried to read it, something in the system or tcp stack was busy, and rather than block, it returns EAGAIN and lets you do something else in the mean time before trying again. I agree with you that the retry logic in your code snippet is borked. It will only attempt one retry and will never do the usleep. What you choose to do about that is up to you. Personally, I think I'd probably omit the retry logic completely and just do the microsleep and loop back to the select; but you should probably consult someone with more *nix experience than me if that is your platform. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: parallel process on remote machines,read results and hanle timeout of those process by x12345 (Novice) on Oct 31, 2014 at 15:36 UTC
Re^5: parallel process on remote machines,read results and hanle timeout of those process by BrowserUk (Patriarch) on Oct 31, 2014 at 15:56 UTC
Some notes below your chosen depth have not been shown here
Re: parallel process on remote machines,read results and hanle timeout of those process by Anonymous Monk on Oct 31, 2014 at 00:27 UTC
I am categorically uncomfortable with the idea of polling one thousand machines "simultaneously" . . . Do 'em say a hundred at a time. Stay away from The Edge, no matter what language you are using.	[reply]


No such thing as a small change
	PerlMonks