http://www.perlmonks.org?node_id=431043

K_M_McMahon has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks (some of whom are esteemed),

I came across a problem at work with some scripts that were written (and are supposed to be maintained by our software group. I may sound kind of long-winded in this post so <readmore> tags will be used; please bear with me.

Background

I work in satellite operations (nothing cool like satellite TV, just NASA). One of the satellites I work for is not manned 24/7. We have software that is designed to manage the satellite when we are not staffed and notify us whenever something is wrong, either with the spacecraft or the ground system. The spacecraft was launched in 1995 and built well before that, so much of the hardware that we have is out-dated.

To make a long story short, we have 2 strings of independent computers (HPUX system) that are basically duplicates of each other. We also have 1 machine which is not part of either string that acts as a Failover Monitor to keep an eye on the other machines and if a problem is detected, move operations from the prime string to the backup string.

Problem

Saturday night, we had a hard drive that was mounted to string 1 fail. This had the effect of locking all the machines to which it was mounted. They were still up and appeared to be okay, but processes on them were frozen. This is exactly the type of thing that the Failover Monitor was designed to detect and correct by forcing a failover to string2. The Failover monitor did not notify the Flight Operations Team (FOT) that anything was wrong or take any action.


Luckily for me, most of the code is written in Perl so I could look through the code and figure out what happened. I have determined the cause of the problem, and I have a few ideas about how to go about fixing, but there are several restrictions.

1) The code is controlled by the software group so I cannot *directly* modify it myself, fortunately I work closely with the developer so I do actually get to create some of the code, but at the least, tell her how I want it done.

2) As stated above, this is an outdated system. We are currently running Perl Version 5.004_01. I have attempted in the past to get them to upgrade but to no avail.

3) Pure Perl modules that do not need to be installed I can add by putting them in my own directory and referencing them. Non Pure-Perl modules that are not included in standard 5.004_01 release will be not be available for use, and the system admins will not install them for me.

The actually offending code is one of the two following commands (either of which would create the same problem, I just don’t know exactly where in the code the script was when it became frozen).
chomp($pse=`remsh $status[0] ps –ef | grep ‘pse’ | grep –v grep | wc – +l`);
or
`rcp $status[0]:alive.log alive.log.p`;

One of these two commands was issued while string 1 was locked. The remsh or rcp connection was opened (or did not fail) but the process never completed. It sat at this position holding the calling script hostage, hence it was never able to notify the FOT that anything was wrong.

I have come up with a few ideas about how to get around this problem:
1) Eliminate the system calls and use Net::FTP and Net::Telnet where I can set a timeout period so that if the command does not complete in X time, the failover monitor will realize something is wrong and can contact the FOT.

2) Eliminate the system calls and use a socket connection, of which I am not familiar, so some direction towards a good tutorial would be helpful (Can’t find one on here)

3) If the developers are insistent on keeping their system calls, I can at least get them to modify it so this problem does not re-occur, sloppy/in-elegant example that still has problems:

#Start the remote shell and get the PID of the process #Pipe the output into a file chomp($pse_id=`remsh $status[0] ps –ef | grep ‘pse’ | grep –v grep | w +c –l>pse_status &`); my $not_done=1; my $error_count=0; while ($not_done==1) { my $running=`ps –ef | grep ‘$pse_id’ |grep –v grep |wv –l`; if ($running==0) { #process is still running, sleep then try again increment erro +r counter if ($error_count>10) { &notify_FOT; } else { $error_count++; } sleep(5); } else { open(TEMP,”<pse_status”) or &notify_FOT; my @temp=<TEMP>; close(TEMP); $pse=chomp($temp[0]); last; # or $not_done=0; } }
4) Some other method that I am not thinking of.

Questions:

If you read this far, thanx!
1) Which method do you think is the best for preventing this sort of problem?

2) Have you found any simple probelms in someone else's code that cause BIG problems where you work?


-Kevin
my $a='62696c6c77667269656e6440676d61696c2e636f6d'; while ($a=~m/(^.{2})/s) {print unpack('A',pack('H*',"$1"));$a=~s/^.{2}//s;}

Replies are listed 'Best First'.
Re: Executing Systems Calls with Care
by Anonymous Monk on Feb 15, 2005 at 06:50 UTC
    Isn't this exactly what alarm is designed for? perldoc -f alarm worked pretty well for me in the past.

      alarm() will not interrupt system calls in perl starting with 5.8. Since you are using a version older than that, you may be able to use it.

        But it does.

        $ perl5.8.6 -wle 'alarm 3; `cat /dev/zero`' Stops after 3 seconds.
Re: Executing Systems Calls with Care
by matija (Priest) on Feb 15, 2005 at 08:24 UTC
    Rewriting all the remote calls to use Net::Ftp and Net::Telnet would be a lot of work. Using sockets directly would be even more (and unnecessary) work.

    To fix the deadlock problem, the easiest fix would be IMHO to go through the code line by line, determine which calls might hand due to problems on the remote system, and wrap such calls in

    eval { local $SIG{ALRM} = sub { die "alarm\n" }; # NB: \n required alarm($timeout); # # do stuff that might time out # alarm 0; }; if ($@) { die unless $@ eq "alarm\n"; # propagate unexpected errors # handle the timed out operation } else { # operation didn't time out, handle it's result }
    Once you've done that, try to figure out what kind of problems on the monitoring machine you need to report to the team, and how you're going to notice if the monitoring machine silently fails.
      UPDATE: I modified this post because I thought noone had answered it and I thought better of my question after testing. In the interest of keeping following posts On Topic, this sub_question was initially,
      "Will it work if you substitute &some_subroutine; instead of die?" in the section local $SIG{ALRM} = sub { die "alarm\n" };
      CURRENT QUESTION:
      local $SIG{ALRM} = sub { die "alarm\n" };
      I can replace the die "alarm\n" section here with a subroutine with no problems (actually tested it ;->)

      but this brings up something else that I don't understand.. What causes the above example to enter the
      if ($@) { die unless $@ eq "alarm\n"; # propagate unexpected errors # handle the timed out operation }
      section?

      -Kevin
      my $a='62696c6c77667269656e6440676d61696c2e636f6d'; while ($a=~m/(^.{2})/s) {print unpack('A',pack('H*',"$1"));$a=~s/^.{2}//s;}
        Will it work if you substitute &some_subroutine; instead of die?

        Well, in principle, sure. But make sure that he some_subroutine realizes that there may be error conditions it can't handle, and dies if it encounters a situation it isn't explicitly designed to handle.

        You need to be defensive when you're designing this subroutine. Defensive and paranoid.

        As 'perldoc -f eval' says:

        If the code to be executed doesn't vary, you may use the eval-BLOCK form to trap run-time errors without incurring the penalty of recompiling each time. The error, if any, is still returned in $@.

        Ordinary morality is for ordinary people. -- Aleister Crowley
      One problem with using alarms to time out system() is that it doesn't guarantee that the child process (in this case the entire pipeline) will die from the alarm itself. If the alarm triggers, you need to make sure you kill off the child pipeline.
Re: Executing Systems Calls with Care
by bluto (Curate) on Feb 15, 2005 at 18:47 UTC
    my $running=`ps –ef | grep ‘$pse_id’ |grep –v grep |wv –l`;

    FWIW using a shell pipeline is nice and concise, but there are problems with it if you really care about detecting if it is working properly.

    First you aren't checking for errors at all from this command. Second, even if you were, some shells do not return the error from the earlier commands in a pipeline (e.g. if ps returned an error it would appear that the process was gone when in fact it could still be running). Third, you aren't being precise in what you want grep to find. For example, what happens if $pse_id is something like 100. Then this command would match other processes like with '100' anywhere in the 'ps -ef' (i.e. different PIDs like 10037, part of the command name, user name, parent PID, etc). In this case the best way to find out if a process is still running is to remember it's pid and then either use 'waitpid()' or use something like 'kill 0, $pid;'.

    YMMV, but if I really want to check things like 'ps' output like this I tend to use a form of ps that I know the exact format of where the PID is in the line (e.g. "ps -o <fmt> ...") and I parse the output myself within perl.

Re: Executing Systems Calls with Care
by zentara (Archbishop) on Feb 15, 2005 at 13:50 UTC
    I think your idea of using a direct socket connection is the best. It is not that hard to find examples of creating a socket connection between 2 machines. It is just a client-server pair, 2 programs, one sends and one receives; and the socket is just like a "filehandle" which you "read from" or "write to". Just have the "machine being monitored" write a timestamped line to a file on the "monitoring machine", and have a daemon sitting on the monitoring machine watching those timestamped lines come in. If a line is late, notify the technician.

    I think that would be the most reliable solution.


    I'm not really a human, but I play one on earth. flash japh