Nodes in some load balanced web servers were taking turns going down. They were taking too long to respond to TCP connections.
I didn't have any graphing or monitoring. I wrote something quickly to tell me when a server was taking too long to respond. Then I could login and debug.
This program prints a warning when the connection does not complete in 1 second. I also added a signal handler to dump a summary of the connection statistics.
Update: The code is on github.
Sample output:
$ ./sas-monitoring
./sas-monitoring is starting as pid 27925. From another window you ma
+y run:
kill -USR1 27925
......................................................................
+.....................................................................
+......Could not create socket to 10.3.7.107 : Connection timed out
-- 8< -- output removed here -- 8< --
10.3.7.62: failures: 103 successes: 1092
10.3.7.109: failures: 79 successes: 1116
10.3.7.78: failures: 78 successes: 1117
10.3.7.105: failures: 29 successes: 1166
10.3.7.108: failures: 28 successes: 1167
10.3.7.106: failures: 6 successes: 1189
10.3.7.79: failures: 3 successes: 1192
10.3.7.63: failures: 2 successes: 1193
10.3.7.60: failures: 1 successes: 1194
10.3.7.107: failures: 1 successes: 1194
10.3.7.61: failures: 0 successes: 1195
The code:
#!/usr/bin/perl -w
#
# file: sas-monitoring
# purpose: monitor sas web servers
use strict;
use Data::Dumper;
use IO::Socket;
my @WEB_SERVERS = qw(
10.3.7.105
10.3.7.106
10.3.7.107
10.3.7.108
10.3.7.109
10.3.7.62
10.3.7.63
10.3.7.78
10.3.7.79
10.3.7.60
10.3.7.61
);
my $TIMEOUT_SECONDS = 1;
my $SERVER_PORT = 80;
# -- main --
my %connection_stats; # for tracking connection success/failure counts
for (@WEB_SERVERS) {
$connection_stats{$_} = {};
$connection_stats{$_}{success} = 0;
$connection_stats{$_}{failure} = 0;
}
# GOAL : give the user a way to read connect stats via the USR1 signal
+.
# DOC : this signal will interrupt the socket connect call and count a
+s a
# failure
$SIG{'USR1'} = sub {
my @ip_list = keys %connection_stats;
@ip_list = sort
{ $connection_stats{$b}{failure} <=> $connection_stats{$a}{fai
+lure} }
@ip_list;
print "\n";
for my $i (@ip_list) {
print sprintf("%16s", $i), ": failures: ",
$connection_stats{$i}{failure}, " successes: ",
$connection_stats{$i}{success}, "\n";
}
};
print $0 ," is starting as pid ", $$ ,
". From another window you may run:\n kill -USR1 ", $$, "\n";
while (1) {
# GOAL : try to connect to the server
SERVER: for my $server_ip (@WEB_SERVERS) {
my $sock = new IO::Socket::INET (
PeerAddr => $server_ip,
PeerPort => $SERVER_PORT,
Proto => 'tcp',
Timeout => $TIMEOUT_SECONDS,
);
unless ($sock) {
warn "Could not create socket to $server_ip : $!\n";
$connection_stats{$server_ip}{failure} ++;
next SERVER;
}
$connection_stats{$server_ip}{success} ++;
close($sock);
syswrite(STDOUT, '.', 1);
}
# wait a second then check again
#syswrite(STDOUT, "\n", 1);
sleep(1);
}