Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Flaky Server (IO::Socket and IO::Select Question)

by ginseng (Pilgrim)
on Jun 12, 2001 at 09:41 UTC ( #87729=perlquestion: print w/ replies, xml ) Need Help??
ginseng has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm working with perl in an industrial environment, running a machine control interface on OpenBSD. The interface communicates with a piece of industrial control equipment, a motion controller, that has an ethernet port. Almost everything is fine and dandy.

However, the motion controller's interface flakes out every now and then, and the link hangs. In essence, I need to break down and rebuild the connection. I know how to do this, but what I don't know is how to tell when I need to do this.

I'm using IO::Socket::INET to set up the link, and IO::Select to find out who can_read(). I thought of using a timeout on the can_read(timeout) call but the STDIN (actually redirected through TCPSERVER from the operator interface device) will reset that value when it wants more action. I'm using the Timeout value on the INET call: my $mcserver = new IO::Socket::INET(PeerAddr => $mchost, PeerPort => $mcport, Timeout => 6);

but it seems to have little effect. Finally, I've got an 'if' statement on my sysread:

if ($handle == $mcserver) { if (sysread $handle, $result, 1024) { # ... deal with it ... } # if the program gets here, the controller went away. else { logger("Motion controller died unexpectedly."); $mcserver->shutdown("SHUT_RDWR"); $links->remove($mcserver); $mcserver = new IO::Socket::INET(PeerAddr => $mchost, PeerPort => +$mcport, Timeout => 6); $links->add($mcserver); }

which doesn't seem to cut the mustard either. In fairness I should note that this entry does make it into the log, about 8 minutes after the link fails. (In testing, I'm yanking the ethernet cable.)

Obviously, there's something I'm missing, a lack of understanding. I'm hoping you can help.

Ready for enlightenment,
ginseng

Comment on Flaky Server (IO::Socket and IO::Select Question)
Select or Download Code
Re: Flaky Server (IO::Socket and IO::Select Question)
by Arguile (Hermit) on Jun 12, 2001 at 12:59 UTC

    I'm not even close to the level where I can answer this yet but one statement caught my interest.

    "In fairness I should note that this entry does make it into the log, about 8 minutes after the link fails"

    I recently was browsing Brother Dominus's tome on "Suffering from Buffering" and that statement seem like it might be a like problem. On the off chance it does correlate, I though I'd mention it.

    An excerpt from the tome:

    In Perl, you can't turn the buffering off, but you can get the same benefits by making the filehandle hot. Whenever you print to a hot filehandle, Perl flushes the buffer immediately. In our log file example, it will flush the buffer every time you write another line to the log file, so the log file will always be up-to-date.

    [snip]

    If you happen to be using the FileHandle or IO modules, there's a nicer way to write this:

    use FileHandle; # Or `IO::Handle' or `IO::'-anything-else ... LOG->autoflush(1); # Make LOG hot. ...

      Thank you Arguile.

      I did have some trouble with buffering, but not in these areas. STDIN and STDOUT are autoflushed when connected to an interactive terminal, but not when redirected. Because of this, my code worked fine when run directly, but hung miserably when run under djb's tcpserver. i finally added:

      # turn on autoflush for STDIN and STDOUT select STDIN; $|=1; select STDOUT; $|=1; select STDERR;

      to deal with that.

      Likewise in the log file, I'm doing much the same thing, except I'm not keeping the filehandle when not in use.

      sub logger { # set up a file handle for a log file. open LOG, ">>$log_file" or die("Could not open log file for appendin +g."); select LOG; $|=1; select STDERR; print LOG "weldd $$: ", scalar(localtime), " @_\n"; close LOG; }

      Alas, the IO::Socket stuff shouldn't require autoflushing, per perldoc:

      perldoc IO::Socket
      ...
      NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE
      
      As of VERSION 1.18 all IO::Socket objects have aut-
      oflush turned on by default. This was not the case
      with earlier releases.
      
      NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE NOTE
      ...
      

      The uptake of this is that, while I did have problems with buffering, I don't think it's the current problem :(

      ginseng

Re: Flaky Server (IO::Socket and IO::Select Question)
by bikeNomad (Priest) on Jun 12, 2001 at 19:35 UTC
    First, congratulations on using Perl as it was meant to be used: in embedded systems, not in an unpleasant CGI script <g>

    Second, it's not clear to me what you mean about your difficulty using select:

    I thought of using a timeout on the can_read(timeout) call but the STDIN (actually redirected through TCPSERVER from the operator interface device) will reset that value when it wants more action.

    You should be able to use select on both your TCP connection and STDIN; when you come out of select, zero or more file handles will be ready to read (or write, or will have exceptions on them, depending on what you were looking for). If zero, you had a timeout.

    You may want to look into the Event or POE CPAN modules, both of which allow for simple creation of event-driven systems that can use select as well as timer events.

      Yes, yes, and I just got my new license plate today: CTL GEEK. I'm happy.

      My difficulty with the select->can_read() statement is with it's appropriateness for my task, and not with the IO::Select code at all. It returns at the end of timeout to tell you nothing came from any handle, as designed. I need something that will tell me, "You told me to listen to this handle, but the handle went away and you'd better do something about that."

      For my main purpose, monitoring the operator interface on one socket (though treated as STDIN) and the motion controller on the other, it does exactly as specified.

      But again, what I really need is to know how get IO::Socket or IO::Select to tell me that a socket went away.

      ginseng

(tye)Re: Flaky Server (IO::Socket and IO::Select Question)
by tye (Cardinal) on Jun 12, 2001 at 19:39 UTC

    Well, yanking your network cable could certainly cause TCP to take 8 minutes to decide that the link is down. But if select has said that you can read from the socket, then sysread shouldn't hang, even if you've yanked the network cable. I'm tempted to call that a bug in your TCP/IP stack at this point (though that seems unlikely with BSD).

    The "Timeout" parameter is probably only going to affect IO::Socket methods so sysread probably won't honor it. However, looking for a "read" operation in IO::Socket and IO::Socket::INET, I was surprised to only find recv which didn't appear to honor the timeout either. Now IO::Socket does some tricky things so the timeout might be handled but in a way that wasn't obvious. So you could try using $mcserver->recv(...) instead.

    Some timeouts in Socket are (or at least used to be) handled via alarm, so you could go that route. Though this could eventually cause corruption in Perl's internal state which would eventually kill a long-running process. (Though "safe Perl signals" will likely appear in the next major release of Perl, perhaps sooner.)

    In the face of sysread blocking after select said it wouldn't, I might resort to having a watchdog process. You could avoind having the watchdog depend on select by having the main process, for example, append one byte to an open file each time it reads or write a packet (and truncate the file every M bytes). Then the watchdog could check the mtime or size of the file every N seconds and, if it doesn't change, kill the main process, start a new one, etc.

            - tye (but my friends call me "Tye")

      tye/Tye, thanks

      My original post may have been unclear. I said "hang" but I didn't necessarily mean that the program hung, only that it kept on trucking despite the fact that the motion controller is no longer there.

      To restate my problem, this code is a middle-man. On opposite sides of it's table are the client (the operator interface) and the server (the motion controller.) If the client goes away, the middle man will kick the server out too. If the server goes away, the middle man will try to bring it back.

      In actuality, the server is going away (every couple of days on its own; every time I yank the network cable in test) and the middle man just keeps on listening to an empty chair. I want it to get off it's butt and get the server back to the table.

      Does that make sense? I keep feeling like there should be a good way of doing this, and fearing there is not...

      ginseng

Re: Flaky Server (IO::Socket and IO::Select Question)
by bikeNomad (Priest) on Jun 13, 2001 at 03:02 UTC
    Ah, I think I understand now. Perhaps rather than IO::Select::can_read on your TCP handle, you should instead use IO::Select::select, and look for exceptions as well as readable handles (and also get the timeout). I'm not sure, but it seems like you might get an exception on a downed IP connection. And being able to query the state of all your handles on a timeout is nice, too.

    Like you, I don't much like testing this occurrence using a failed sysread() call.

    Interesting voting on my earlier post, BTW. Wonder if it was that I was wrong somehow or just that I offended the CGI orthodoxy here? No matter...

      Ahah! printing perldoc IO::Select now...

      Researching IO::Select->select()

      Changing code...pretty major changes, getting an array of arrays, rather than a singular array in return...gotta handle both the array of handles ready to read, and the array of handles with problems...

      Testing code...

      ...fixing the stupid mistakes...

      ...forgot the "my" for this scope...

      ...it runs! but now it doesn't recognize errors on either link :(

      ...some more debugging...

      Okay. Kind of. The first thing I learned about IO::Select::select is that it has to be called differently. When I tried this: (code simplified slightly to show the pertinent parts)

      my $links = new IO::Select(); $links->add($mcserver); $links->add(\*STDIN); ... while (my @allhandles = $links->select($links, undef, $links) {
      the select call didn't block at all. Every time it hit, it returned an array of three arrays, all of which had no handles in them.

      Looking back at the perldoc IO::Select page, it says "Upon error an empty array is returned."

      TIP #1: The returned array is not empty. The arrays contained in the returned array are empty.

      So now I knew (rather, presumed) there was an error, so back to the docs I went. "'select' is a static method, that is you call it with the package name like 'new'." I finally figured out what that meant...my:

      while (my @allhandles = $links->select($links, undef, $links) {
      should have been
      while (my @allhandles = IO::Select::select($links, undef, $links) {
      and after I changed that, it blocked properly.

      TIP #2: 'Static' methods are called by package name, not by instance. (You probably knew that. I learned the hard way.) ;)

      The other thing I changed was how I handled a disappearing client. I figured the error array would report an EOF, telling me the client is no longer present. (It was a nice try...) I did have a sysread like this:

      if (sysread $handle, $command, 1024) { ... do stuff } else { ... client went away...handle it. }
      I found it is still necessary :)

      TIP #3: A closing socket (at least from telnet) is not an error, as far as IO::Select::select is concerned. (Probably a very reasonable thing.)

      So now I've got the basics covered (i.e. I'm back to where I started with can_read()), and I've just yanked the ethernet cable to simulate a flaky server. Minutes pass... I take a smoke break. I have the client do things I know will generate traffic to the (disconnected) server. Still, I get no valid errors :(

      Bummer.

      TIP #4: Just because you learned a lot doesn't mean your code is right...

      Maybe there are flags I should be setting? Maybe I need to build a programatic shrine to St. Larry in my code?

      ginseng

        Yes, as I hinted elsewhere, it can takes several minutes for TCP to complain in the slightest in the face of a machine that is completely unresponsive. Things like "ICMP host unreachable" (in the case of a smart router) and "connection reset" (in the case of a rebooting server) can hurry this along.

        So you need to put your own maximum silence time into your code based on what makes sense for your situation. Usually this involves coming up with some harmless "heartbeat" packets that can be exchanged. In the face of an existing protocol, you hope to find some nondestructive "get status" request that you can send if there has been no other reason to talk to the server in the past N seconds. Then you can reset the connection whenever you have not gotten anything from the server for 2*N seconds.

        BTW, the reason that TCP takes so long to notice a dead connection is that the protocol, by default, assumes that it can take up to 2 minutes for a packet to traverse the network. This means 4 minutes round trip and about 8 minutes to retry packets enough times that you decide to give up.

        In many (most) modern uses of TCP (at least those that don't involve dial-up users, non-terrestrial spacecraft, or carrier pidgeons), this 2-minute max time is something like an order of magnitude longer than probably makes sense. You may check if your TCP stack supports configuring this value down to something more reasonable (but beware of changing this casually!).

                - tye (but my friends call me "Tye")

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://87729]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (12)
As of 2014-07-23 10:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (140 votes), past polls