http://www.perlmonks.org?node_id=747486

kp2a has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that polls thousands of devices with Net::SNMP
If device needs attention I connect via SOCKET in myownmodule.pm and diddle some parameters in the device.
Works AOK but after a random number of diddles like one to a few dozen which are successful, perl stops, no error message, no crash, nothing unique about the device it stops on, echo $? = 0
Successful diddles are reported via STDOUT to a log file. The log file records the last diddle up to the point of executing the updating of a parameter in the device but perl quits talking before the confirming log message.
Upon polling the device afterwards, the update was successful!
This is very annoying because I have thousands of devices to tend to!
I need a clue of where to start looking and/or how I might capture some debug information. It would be rather tedious to run in debug mode.
Thanks!

Replies are listed 'Best First'.
Re: unexpected quit - no error message
by almut (Canon) on Mar 02, 2009 at 16:44 UTC

    Just a wild guess... (anything else is hard without seeing the code :): your script might be trying to write to a closed socket (closed by the device). This generates a SIGPIPE signal, whose default handler action is to abort the program (assuming you're on Unix).

    Consider this example

    #!/usr/bin/perl use strict; use warnings; use IPC::Open3; # uncomment to handle SIGPIPE yourself # $SIG{PIPE} = sub { warn "Broken pipe\n" }; open3(my $wh, my $rh, undef, qw'echo foo'); # tiny delay, so the command _has_ closed its stdin in the meantime select undef,undef,undef, 0.2; print $wh "foo\n"; # expected to fail in this case close $wh; print "done.\n"; # not reached, except when handling SIGPIPE yourse +lf

    When you run this as is, it quits without printing "done.", because it's being terminated before. When you setup your own SIGPIPE handler, OTOH, you'd get your (self-generated) "Broken pipe" warning message, followed by "done."

    (Note that this sample snippet would not exit with $? being zero (so this simple theory doesn't quite fit your case), but there are other circumstances conceivable where it might...)

    How to debug?  You could run your script under strace (a tool I tend to recommend at least twice a week here :) in order to figure out what your script is doing last... This would likely prove helpful anyway, even if my above theory is wrong.

    E.g. in the above sample case, you'd see something like this at the end of the trace

    $ strace ./747486.pl (...) select(0, NULL, NULL, NULL, {0, 200000}) = ? ERESTARTNOHAND (To be res +tarted) --- SIGCHLD (Child exited) @ 0 (0) --- select(0, NULL, NULL, NULL, {0, 200000}) = 0 (Timeout) write(8, "foo\n", 4) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++

    (notice the write(8, "foo\n",... line which returns with EPIPE)

      AH! Thanks - I am learning - did not know about

      "This generates a SIGPIPE signal, whose default handler action is to abort the program (assuming you're on Unix)"

      Yes, UNIX, is there any other OS? I will check for that out. Try to catch it.

      two questions:
      why $? = 0 = success on abort?
      why an abort after> a successful write to the socket? - the device was updated AOK.

      thanks other suggestions - I do have debug statements around the spot where it seems to quit - again they must have been executed but the output does not appear! I assume that if it is an abort, the output buffer is not written???

      correction: using IO::Socket

        two questions:
        why $? = 0 = success on abort?
        why an abort after a successful write to the socket? - the device was updated AOK.

        If we had something to look at, we'd need to guess less :)

        Could you show the significant parts of your code?  Preferably also an strace of a run that did abort (the last 20, or so, lines should be sufficient, typically).

        If you don't see the debug messages, you may be suffering from buffering. This may be especially the case when the program terminates unexpectedly and the script is not able to flush its buffers.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: unexpected quit - no error message
by zentara (Archbishop) on Mar 02, 2009 at 16:10 UTC
    If I was troubleshooting this, I would sprinkle print statements liberally in the code and log them. Even start with print "1\n"; print "2\n";, etc. Then isolate what part of the code gets executed, and where it hangs.

    I know with "devices", code can hang so badly it will lock up the software, sort of in a "wait and retry" state.

    You may have to isolate the line, wrap it in an eval, with a timeout, and check what the eval error is. Google for "perl eval error" and you will find may examples.


    I'm not really a human, but I play one on earth My Petition to the Great Cosmic Conciousness
Re: unexpected quit - no error message
by shmem (Chancellor) on Mar 02, 2009 at 16:11 UTC
    I need a clue of where to start looking and/or how I might capture some debug information.

    One starting point is - get wireshark and capture the SNMP traffic.

    Next thing, wrap your "diddling" code into a block eval and examine $@ and the state of the "diddled" device afterwards.