Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Detect a hung process

by gri6507 (Deacon)
on Apr 12, 2005 at 20:03 UTC ( #447153=perlquestion: print w/ replies, xml ) Need Help??
gri6507 has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I have a UNIX program that, depending on the passed in options, could take a couple of minutes up to a couple of days to execute. This program talks to hardware through blocking calls that normally come back in at most a few minutes. However, when 'hardware problems' occur, the program would simply hang (because of blocking calls, as expected). Since these 'hardware problems' could occur at any time, I would like the program to send me an indication (an email probably) wheneven such a hang condition is detected.

I know how to send an email out once the hung state is detected. The question I have is how do I find out that this program is hung on a blocked call? In other words, I need a perl script that would look at the Program Counter (PC) of my program and see if the PC has not changed for some predefined time interval. Can that even be done? Any suggestions?

Comment on Detect a hung process
Re: Detect a hung process
by bluto (Curate) on Apr 12, 2005 at 20:42 UTC
    If you can't modify to the program you want to monitor, you may have to resort to looking to see how long the process has been in a continuous blocked state, possibly by looking at the output from some form of the 'ps' command on your system.

    Otherwise, if you want to interrupt the calls, look at timing them out with alarm(). This doesn't always work (esp for IO blocked calls).

    One low tech solution might be to have your program touch a file just before it begins a blockable call. Another process could watch for this file and complain loudly if it existed and its timestamp was old. Once your first process finished the call, it could just remove the touched file. You can use lockfiles, logfiles, semaphores, shared memory, signals, parent/child pipes, etc in the same kind of way if you are creative, but of course you have to implement this every place in your code where you make this kind of call.

Re: Detect a hung process
by fauria (Deacon) on Apr 12, 2005 at 21:10 UTC

    If i understood well your post, you need to look into a program internals (stack, pc, etc) to ensure it has not hung. I also understand it is propietary software, orherwise it would be much easier to modify directly that program.

    If this is right i would use directly a process or system call debugger running in top of your program, like ptrace or strace, to see what is happening during execution, and use its output to kill the process if something goes wrong.

    You have a Perl interface to ptrace in CPAN. Hope this helps.
Re: Detect a hung process
by 5mi11er (Deacon) on Apr 12, 2005 at 23:30 UTC
    And just to ensure that this point is realized, you need to create an arbitrary rule that states how long the blocking can happen before you decide that it is "Hung". There's never going to be a magic "it's definitely hung at this point" line in the sand, but for practical purposes, you should know approximately where to create that line.

    I would imagine it's better to kill and restart a process suspected of being hung, but not actually hung, than letting a process that is hung stay that way.


Re: Detect a hung process
by moot (Chaplain) on Apr 13, 2005 at 03:30 UTC
    As other posters have mentioned, it is next to impossible to detect a hung process from outside without really having a clear determination of what can be measured to determine hung-ness. Simply examining the PC won't necessarily help, since it will change inside a system call even though that call never returns (for whatever reason)(*). You could rely on a timer (if no data is received/ sent for X minutes, restart the process no matter what) and would get varying degrees of success, but again you would need some way to measure this data from outside the app - perhaps it writes to a log file or whatever.

    From a quick hack point of view you will have some luck with the above approach, but for true robustness you'll be better off modifying the app itself, if possible.

    (*) - Actually you *might* get some results from running the app under an strace-like utility, and watching to see if any system call loops for "too long" - but you're left with again defining what "too long" means, and of course parsing strace output.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://447153]
Approved by Joost
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-07-12 21:23 GMT
Find Nodes?
    Voting Booth?

    When choosing user names for websites, I prefer to use:

    Results (241 votes), past polls