Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

ssh output is partial when using fork manager

by Anonymous Monk
on Jan 23, 2018 at 22:49 UTC ( #1207787=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have created a script that SSHs to multiple nodes (~150) and issues commands and then process the output. I have used this script successfully in the past but the cli commands produced a short output. Now the cli command I am using produces ~70MB output. Initially I tested using 3 nodes and it was successful. However, when I run it on all nodes, the output (as seen from logs) is cut short in most nodes. You can see that I have set the ssh output timeout at 240s but it doesnt take that long in reality. I am not sure in what issue I am running into but seems to be performance related since it doesn't happen when the number of nodes (and children processes) is low.
foreach my $node (@nodes) { #sleep(1); my $pid = $pm->start and next; my $nodeip = $Configuration::node_list_nmnet{$node}; $datestamp = strftime("%Y%m%d%H%M", localtime); my $ssh = Net::SSH::Expect->new ( host => $nodeip, password=> "*****", user => "******", timeout => 5, raw_pty => 1, log_stdout => 0, exp_debug => 0, log_file => "/home/logs/$node/diam.$node.$datestamp.log" ); my $login_output = $ssh->login(); $ssh->waitfor("#", 30); $ssh->send("show diameter peers full debug"); $ssh->waitfor("#", 240); my $output; $output = $ssh->before; if ($output =~ /Peers in CLOSED state\S*\s*(\d*)/ ){ $peersvalue = $1; if ($peersvalue > 0){ print "$node\n"; push @emaillog, "$node\n\n"; $sendemail = 1; } } $datestamp = strftime("%Y%m%d%H%M", localtime); $pm->finish(0); } print "Waiting for Children...\n"; $pm->wait_all_children;

Replies are listed 'Best First'.
Re: ssh output is partial when using fork manager
by salva (Canon) on Jan 24, 2018 at 07:46 UTC
    I usually advice against using Net::SSH::Expect which is not reliable. But here you are using it just like Expect...

    The issue is probably caused by the timeouts, 240s may be not enough to transfer ~70MB. Maybe some networks are not fast enough or you are launching too many processes in parallel and overloading the CPU, the network or the disk.

Re: ssh output is partial when using fork manager
by QM (Parson) on Jan 24, 2018 at 11:00 UTC
    Twenty Questions

    Further to salva's reply, try experimenting separately with number of nodes, and length of timeout. You may discover there is a relationship. If there is any variation, try plotting number of nodes and length of timeout to get a successful run.

    Try looking for problem nodes, by splitting the list into halves, or removing 5 or 10 different nodes each time. You may discover that there are one or two specific nodes that get hung up, but only with a large number of nodes (so it could be network traffic congestion, and poor recovery to/from certain nodes).

    What happens to process memory when node count goes up? (Perhaps there's a memory leak/retention you aren't expecting.)

    What happens if you run this from different host nodes? Especially, hosts not on the same end router as the original host?

    Do you have a different large pool of target nodes, other than the original? How does it perform compared to the original?

    Is there anything else you can vary?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: ssh output is partial when using fork manager
by Anonymous Monk on Jan 24, 2018 at 16:48 UTC
    Can you help me understand how Fork Manager works? I have about 180 servers. And the command output for each of those servers takes a few seconds (lets say 20) sec on my screen to complete. If I am not mistaken, the script will try to immediate "parallel" execution but it is not really parallel. So we ssh to 180 nodes (that is fast) and then we start sending the ssh command. This starts the 240s output. However, the processing has to jump from one child to the other and check till we get the character that indicates the end of output and stops timer. Does the ssh timer stop in between checking? I mean it might take 20 sec to reach the end of the output but much more till the processing returns to a specific child. To give you some extra info, I have been printing the time when the child finish and it is more than 240s since the start and most children finish at the same time. I will go ahead and experiment with increasing the timer in the mean time. Thanks for your feedback so far.
      With that many parallel processes all trying for network access, I have encountered some limitation. I think it's not in the host memory or number of processes, but somewhere deeper in the network drivers on the host. You can get a similar result by, for instance, trying to ping multiple hosts in parallel -- above a certain number of hosts, the network response goes horribly sluggish.

      For ethernet, complete congestion results in many retries, with each retry picking a random wait time from an ever larger window (see [no such wiki, Exponential_backoff]). So 200 parallel processes would have many ethernet collisions, and some small fraction would end up with the maximum backoff time. At some point normal ssh connections timeout due to lack of activity, and drop.

      I had exactly this problem with a little script I wrote years ago, before I knew about Parallel::ForkManager and the like. At the time it didn't matter that I didn't get all of the responses, and it wasn't for any automated system, just my own whims on finding a remote host with certain conditions. (See the doc page for how to limit the number of parallel processes.)

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

      OK, So I have changed the timer from 240 to 340. What happened is that script is successful in many more nodes. However now I get a lot of errors SSHProcessError The ssh process was terminated. at diameter_Status_Script.pl line 123. That line is: $ssh->waitfor("#", 240);
        Have you tried reducing the number of parallel processes?

        Run top or your favorite OS monitoring tool to see what is going on in your system.

      Hello all, Thanks for your feedback. I tried to minimize the number of processes running at the same time by introducing a delay in the loop before spawning a new process. This way the total number of parallel processes running would be less, since some would have finished before others start. It made my script a bit slower but I had zero failures.
        But that is precisely what Parallel::ForkManager is for!

        You tell it how many processes to run concurrently when you create the object and then it takes care of never running more than so many processes, delaying the start calls as necessary.

        You can also go for Net::OpenSSH::Parallel which knows how to handle most of the issues you are facing by itself.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1207787]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2021-05-09 08:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (100 votes). Check out past polls.

    Notices?