jlongino has asked for the wisdom of the Perl Monks concerning the following question:
We have a Netscape Enterprise Server 3.6 running on a
Solaris 7 box. It becomes progressively more CPU intensive
the longer it runs. The Web Admin is planning on upgrading
to a more current version within the next couple of months
and does not want to expend any of Web Services energy to
diagnose/solve the problem (hoping that the upgrade will
magically solve everything). So, they want me to provide a
short-term solution.
First, let me say that I'm not happy with the aesthetics of
the situation or solutions I'm considering. I feel that Web Services
should determine what the problem is and fix it. No
analysis has been done to determine what the problem is or
whether upgrading will really make a difference.
Second, I'm just looking for suggestions/feedback, not
code, nor Web Server tuning tips.
Currently, the Web Admin is notified by irrate users or
determines by periodically monitoring the ns-httpd daemon
via top that the server has bogged down. He then
stops and restarts the daemon manually using stop and start
commands. I know this sucks but at least I got him to stop
power-cycling the Sun box whenever response times became
sluggish.
I did some preliminary searches on CPAN and PerlDoc for
perl modules that might give process statistics directly
with no success. My solution is to kick off a perl script
(via cron) every 10 minutes or so, scarf CPU utilization
percentages derived from top 4 or 5 times (pausing a second
or two between polls), average them, make a determination
as to whether or not to perform a stop/start on the daemon,
and then either exit or restart the server.
The Web Server fetches static pages for the most part and
does very little, if any transaction processing.
Thanks in advance for any suggestions.
OT: re: Netscape Enterprise Server 3.6
by gregor42 (Parson) on Aug 09, 2001 at 02:49 UTC
|
OFF TOPIC
Netscape Enterprise Server 3.6 is no logner a supported product. It hasn't been for almost 2 years now.
The memory leak is a well documented problem. It first reared it's ugly head in 3.4. An attempt was made to fix it in 3.5, which was much worse. 3.6 introduced dxwdog which is supposed to be a watchdog script that looks for this very problem. I suggest looking to that & tweaking.
As a web software developer and engineer, my professional advice to you is to move to a supported platform. Even a move to Apache at this time would be far better in terms of performance and stability. The costs associated with that are labor. There's no GUI to configure Apache unless you use something like Tk/Apache a.k.a. Mohawk.
I suggest this since you mentioned that you are only serving static pages for the most part. If you were using LiveWire I'd suggest looking at Resin.
I know that you're trying to solve what appears to be a simple problem. You are not the first one to attempt this.
Migrate/Update!!
Wait! This isn't a Parachute, this is a Backpack!
| [reply] |
|
I appreciate your input and I'll certainly look into
dxwdog like you suggested as it sounds promising.
However, as to being "Off Topic", I thought that it was
clearly stated that I was looking for a perl-based
solution (a module perhaps) that would facilitate
monitoring CPU utilization of a given process. Apparently
it wasn't as clearly stated as I thought, for which I
apologize.
As for upgrading or migrating to Apache, I've already made
both those recommendations but I'm not in a position to
demand them.
| [reply] |
|
I don't think gregor42 meant that your question was off topic... As I read it, the answer was marked as off-topic. (i.e. not really perl related) In the same spirit my answer below is probably off topic as well....
In my experience, scripts that automatically diagnose and "fix" a problem (ala bouncing your webserver) are more trouble than they are worth. I'd recommend running a full fledged monitoring program that alerts you whenever a problem occurs. I've had good success with Big Brother but am seriously considering switching over to netsaint. Both systems religiously monitor everything from memory usage, to internet connectivity, to database connectability (sp?).
You can also write your own "plug-in" scripts to monitor anything you want. Some have even used this feature to send out stock market alerts, or keep an eye out for cheap airline tickets.
While the last two uses are rather esoteric, having a monitoring system that is easily customizable, is crucial to running a high quality internet service.
-Blake
| [reply] |
Re: HTTP Daemonology
by Agermain (Scribe) on Aug 09, 2001 at 00:51 UTC
|
Well, if you don't have access to process statistics, then maybe you could go in through the weblogs? Perhaps you could have a script, run by cron every ten minutes or so, to check up on the weblogs and restart it if it's accumulated too many (if the server gets bogged down by many rapid-fire requests) or too few (if you want to limit visible downtime to the end-user) requests since the last cron check. You wouldn't have to check the actual /data/ in the weblogs, just find out how many linefeeds there are, since there's one linefeed per server event.
Quick, dirty, and you don't need to figure out a new module, at least...?
andre germain
"Wherever you go, there you are." | [reply] |
|
Two thoughts here - not really related:
1) Checking the weblogs really doesn't do much to solve his problem, though. His problem is with CPU time being hogged by that one process. I don't think checking web logs is going to do much beyond telling him whether or not his machine has been hit with a high number of HTTP requests recently. Does that necessarily correlate with poor performance? In some cases, it seems so...but I'm not convinced that web hits alone are going to grind his sun box to a halt. It shouldn't - especially since it sounds like most of the pages are static!
I do like his solution of running top and scraping the output for process info, though.
2) One of the things I do to monitor one of my websites is run a simple perl script in cron using LWP and HTTP::Request modules. This way, you can make your own request to the site, check a the url for response time, and respond accordingly. Either restart the server automatically through that cron job, or, at the very least, fire off an email to you warning of the potential problems.
| [reply] |
|
A non-perl ( yes, I know ) possibility is Sun's SymbEL release 3 out on http://www.sunfreeware.com, which our webmaster runs on his Solaris boxen. He did this due to a custom bit of Java that leaks bad. He is quite taken with it, tho I admit to not playing w/ it. YMMV.
For point 2, that's a beauty idea, even if you only use it to give you a heads up of impending doom!
I'd also, on the point one above, check the firewall and IDS logs to see if anything less-than-tasty is coming from outside.
Lastlly, check out Sun's Sun Performance and Tuning Techniques doc ( you may have to register w/ http://sunsolve.sun.com ). The techniques are pretty light ( I run them from time to time on my heavily utilized firewalls - no perl - w/ negligable additional load ).
As an aside, sysadmins who hang their hopes on patches and new releases exclusively w/o understanding what is *actually* wrong have, IME & IMHO, very short tenures in quality IT staffs. SysAdmins who can diagnose, provide evidence, and occasionally cruft a work-around, thrive - w/ little REM sleep, tho.
UPDATE: My sysadmin comments were directed toward jlongino's web admin, and not at jlongino. jlongino++ for taking this on.
HTH
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.
| [reply] |
|
|
I agree that top is way useful, but be aware there's been a security problem reported (granted, nearly a year ago) on "systems that have top installed with set user or group permissions".
I saw it here--unfortunately, there weren't many details or any further references in this report, which I guess is a compliment to the reader's presumed research skills.
adamsj
They laughed at Joan of Arc, but she went right ahead and built it. --Gracie Allen
| [reply] |
Re: HTTP Daemonology
by kschwab (Vicar) on Aug 09, 2001 at 05:58 UTC
|
A few ideas:
- Scraping top is tough, the solaris supplied /bin/ps
might be easier.try something like ps -e -o pid,pcpu,comm | grep ns-httpd
- If you really want deeper info on cpu utilization, you could look at Solaris::Procfs
- Have you picked up the last patch for Netscape Enterprise 3.6 ? If you can't upgrade to the newer
IPlanet 4.x, you should at least be running 3.6SP3.
| [reply] [d/l] |
|
Thanks for the recommendation. I had already implemented
the program using top, but then rewrote it using
ps -efo pid,pcpu,fname as you suggested (and
then internally grepped out the ns-httpd).
It was a bit easier, but more important, it doesn't require
a non-OS application being installed.
Now that the short-term annoyance is out of the way, I can
concentrate some of the other excellent suggestions that
require lengthier investigation.
Thanks everyone for the input.
| [reply] [d/l] [select] |
|
#! /bin/perl -w
use strict;
my @toplines = `echo q | top`;
my $ok = 0;
for (@toplines) {
next unless /\w/;
chomp;
if (/PID/) {
$ok++;
next;
}
if ($ok) {
my($PID,$USERNAME,$THR,$PRI,$NICE,
$SIZE,$RES,$STATE,$TIME,$CPU,$COMMAND) = split;
print "$PID $CPU $COMMAND\n";
}
}
I know I know your ps solution is great, too ;) I just wanted to show a relatively painless way to scrape top...
| [reply] [d/l] |
|
For those interested, here is a snippet of the ps variation
I ended up using:
#!/usr/local/bin/perl
use strict;
my ($ct, $pcpu, $pname, $i, @outrecs, @matches, $process);
my $MaxUtil = 70;
my $sttotal = 0;
my $ct = 0;
foreach $i (1..5) {
@outrecs = `ps -efo pid,pcpu,fname`;
@matches = grep /ns-httpd/, @outrecs;
## Note: can have more than one ns-httpd running
## but we'll average it in as well
foreach (@matches) {
$ct++;
chomp;
($process, $pcpu, $pname) = split(" ");
$sttotal = $sttotal + $pcpu;
}
sleep(3);
}
my $pcpuavg = $sttotal / $ct;
The generated log file provides interesting data but I
haven't actually been able to make a correlation between
the restart time periods and the access or error logs:
PID TIME CPU% NAME
5257 Wed Aug 8 23:47:07 2001 2.88 ns-httpd
5257 Wed Aug 8 23:49:09 2001 2.84 ns-httpd
5257 Thu Aug 9 00:00:16 2001 4.10 ns-httpd
5257 Thu Aug 9 00:30:16 2001 1.98 ns-httpd
5257 Thu Aug 9 01:00:15 2001 4.16 ns-httpd
5257 Thu Aug 9 01:30:15 2001 2.46 ns-httpd
5257 Thu Aug 9 02:00:16 2001 1.00 ns-httpd
5257 Thu Aug 9 02:30:16 2001 0.70 ns-httpd
5257 Thu Aug 9 03:00:17 2001 1.28 ns-httpd
5257 Thu Aug 9 03:30:16 2001 0.86 ns-httpd
5257 Thu Aug 9 04:00:17 2001 1.52 ns-httpd
5257 Thu Aug 9 04:30:16 2001 0.90 ns-httpd
5257 Thu Aug 9 05:00:16 2001 87.10 ns-httpd
# Avg. Utilization: 87.10% higher than 70%.
# Thu Aug 9 05:00:28 2001 www server restarted.
PID TIME CPU% NAME
10158 Thu Aug 9 05:30:15 2001 0.34 ns-httpd
10158 Thu Aug 9 06:00:16 2001 2.66 ns-httpd
10158 Thu Aug 9 06:30:15 2001 0.52 ns-httpd
10158 Thu Aug 9 07:00:16 2001 86.20 ns-httpd
# Avg. Utilization: 86.20% higher than 70%.
# Thu Aug 9 07:00:25 2001 www server restarted.
PID TIME CPU% NAME
11405 Thu Aug 9 07:30:15 2001 4.92 ns-httpd
11405 Thu Aug 9 08:00:16 2001 6.60 ns-httpd
16991 Thu Aug 9 08:30:16 2001 14.48 ns-httpd
11405 Thu Aug 9 09:00:19 2001 12.51 ns-httpd
11405 Thu Aug 9 09:30:16 2001 14.30 ns-httpd
11405 Thu Aug 9 10:00:16 2001 17.70 ns-httpd
11405 Thu Aug 9 10:30:16 2001 18.10 ns-httpd
11405 Thu Aug 9 11:00:16 2001 15.18 ns-httpd
11405 Thu Aug 9 11:30:16 2001 22.18 ns-httpd
11658 Thu Aug 9 12:00:17 2001 17.62 ns-httpd
11405 Thu Aug 9 12:30:16 2001 13.72 ns-httpd
11405 Thu Aug 9 13:00:16 2001 18.22 ns-httpd
11405 Thu Aug 9 13:30:16 2001 14.30 ns-httpd
11405 Thu Aug 9 14:00:16 2001 15.08 ns-httpd
11405 Thu Aug 9 14:30:16 2001 14.62 ns-httpd
11405 Thu Aug 9 15:00:16 2001 14.18 ns-httpd
11405 Thu Aug 9 15:30:16 2001 10.04 ns-httpd
11405 Thu Aug 9 16:00:16 2001 13.34 ns-httpd
11405 Thu Aug 9 16:30:16 2001 16.72 ns-httpd
11405 Thu Aug 9 17:00:16 2001 11.86 ns-httpd
11405 Thu Aug 9 17:30:16 2001 8.84 ns-httpd
11405 Thu Aug 9 18:00:16 2001 5.76 ns-httpd
11405 Thu Aug 9 18:30:16 2001 8.26 ns-httpd
11405 Thu Aug 9 19:00:16 2001 7.68 ns-httpd
11405 Thu Aug 9 19:30:16 2001 4.64 ns-httpd
11405 Thu Aug 9 20:00:16 2001 3.44 ns-httpd
11405 Thu Aug 9 20:30:16 2001 10.34 ns-httpd
11405 Thu Aug 9 21:00:16 2001 4.92 ns-httpd
11405 Thu Aug 9 21:30:16 2001 8.28 ns-httpd
11405 Thu Aug 9 22:00:16 2001 4.92 ns-httpd
11405 Thu Aug 9 22:30:16 2001 3.32 ns-httpd
11405 Thu Aug 9 23:00:16 2001 3.06 ns-httpd
11405 Thu Aug 9 23:30:16 2001 3.12 ns-httpd
11405 Fri Aug 10 00:00:16 2001 4.20 ns-httpd
11405 Fri Aug 10 00:30:15 2001 3.20 ns-httpd
11405 Fri Aug 10 01:00:16 2001 2.34 ns-httpd
11405 Fri Aug 10 01:30:16 2001 4.72 ns-httpd
11405 Fri Aug 10 02:00:16 2001 0.60 ns-httpd
11405 Fri Aug 10 02:30:16 2001 1.26 ns-httpd
11405 Fri Aug 10 03:00:17 2001 85.22 ns-httpd
# Avg. Utilization: 85.22% higher than 70%.
# Fri Aug 10 03:00:29 2001 www server restarted.
PID TIME CPU% NAME
4859 Fri Aug 10 03:30:17 2001 1.42 ns-httpd
4859 Fri Aug 10 04:00:17 2001 0.38 ns-httpd
4859 Fri Aug 10 04:30:15 2001 0.16 ns-httpd
4859 Fri Aug 10 05:00:16 2001 0.34 ns-httpd
4859 Fri Aug 10 05:30:16 2001 0.22 ns-httpd
4859 Fri Aug 10 06:00:16 2001 0.44 ns-httpd
4859 Fri Aug 10 06:30:16 2001 0.54 ns-httpd
4859 Fri Aug 10 07:00:16 2001 3.52 ns-httpd
4859 Fri Aug 10 07:30:15 2001 2.14 ns-httpd
4859 Fri Aug 10 08:00:16 2001 4.74 ns-httpd
4859 Fri Aug 10 08:30:16 2001 14.14 ns-httpd
4859 Fri Aug 10 09:00:17 2001 7.65 ns-httpd
4859 Fri Aug 10 09:30:16 2001 14.46 ns-httpd
4859 Fri Aug 10 10:00:16 2001 14.40 ns-httpd
4859 Fri Aug 10 10:30:16 2001 7.67 ns-httpd
4859 Fri Aug 10 11:00:16 2001 11.72 ns-httpd
4859 Fri Aug 10 11:30:15 2001 14.84 ns-httpd
4859 Fri Aug 10 12:00:16 2001 12.70 ns-httpd
4859 Fri Aug 10 12:30:16 2001 10.70 ns-httpd
4859 Fri Aug 10 13:00:17 2001 75.44 ns-httpd
# Avg. Utilization: 75.44% higher than 70%.
# Fri Aug 10 13:00:28 2001 www server restarted.
PID TIME CPU% NAME
10115 Fri Aug 10 13:30:16 2001 13.78 ns-httpd
10115 Fri Aug 10 14:00:16 2001 10.00 ns-httpd
10115 Fri Aug 10 14:30:16 2001 6.76 ns-httpd
10115 Fri Aug 10 15:00:16 2001 9.42 ns-httpd
10115 Fri Aug 10 15:30:16 2001 82.92 ns-httpd
# Avg. Utilization: 82.92% higher than 70%.
# Fri Aug 10 15:30:27 2001 www server restarted.
PID TIME CPU% NAME
27219 Fri Aug 10 16:00:16 2001 14.63 ns-httpd
27219 Fri Aug 10 16:30:15 2001 11.72 ns-httpd
27219 Fri Aug 10 17:00:16 2001 7.20 ns-httpd
27219 Fri Aug 10 17:30:16 2001 5.38 ns-httpd
27219 Fri Aug 10 18:00:16 2001 2.98 ns-httpd
27219 Fri Aug 10 18:30:16 2001 6.62 ns-httpd
27219 Fri Aug 10 19:00:22 2001 74.63 ns-httpd
# Avg. Utilization: 74.63% higher than 70%.
# Fri Aug 10 19:00:36 2001 www server restarted.
PID TIME CPU% NAME
10103 Fri Aug 10 19:30:16 2001 4.70 ns-httpd
10103 Fri Aug 10 20:00:17 2001 3.77 ns-httpd
10103 Fri Aug 10 20:30:16 2001 4.94 ns-httpd
10103 Fri Aug 10 21:00:16 2001 2.78 ns-httpd
Note: I am still investigating the other suggestions.
|
| [reply] [d/l] [select] |
Re: HTTP Daemonology
by Bucket (Beadle) on Aug 09, 2001 at 06:39 UTC
|
Here at work I've used Proc::ProcessTable to make a program that is similar to what you want, except ours looks for processes that have been running at high CPU usage for more than an hour with regular priority. It should be able to do what you want without a problem. It's at CPAN of course. | [reply] |
Re: HTTP Daemonology
by dga (Hermit) on Aug 09, 2001 at 02:47 UTC
|
You may also check the memory utilization change over time as it could be that the server has memory leakage problems much like the browser of a similar name. If thats the case, an upgrade may address it.
That is of course if this has anything at all to do with the problem in the first place, but a running tally of memory use over time will point this out or clear it from consideration quickly.
Also if its memory related, once you determine the amount of ram gulped to make it slow you could restart based on that.
| [reply] |
Re: HTTP Daemonology
by clemburg (Curate) on Aug 09, 2001 at 17:43 UTC
|
Given the response of your Web Admin/Web Services,
why not simply restart the daemon/services that get
bogged down cyclically (e.g., once a day, once an hour)
by some script. As you describe the situation, this should
provide a workaround and has the advantage of needing only
minimal effort.
Disclaimer: yes, this is not the professional way to do it.
But if they don't want to diagnose ... how can you really fix the problem?
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com
| [reply] |
Re: HTTP Daemonology
by elwarren (Priest) on Aug 09, 2001 at 17:37 UTC
|
You would be far better off using the sar, iostat, and vmstat commands to monitor your machine. They will give you much better information about what is happening than top will. They do exactly what you want, without the overhead of starting perl from a cronjob every ten minutes (which would probably throw your stats.)
These tools are not process specific, so I would use these in combination with the output of ps. Then you could monitor the amount of ram per process.
I'm very interested to see if there are any solutions that are more Perl specific. Some of the *::proc modules look promising, but I've never used them.
If the server bogs down after a semi-regular interval you could just stop and start the server every 4 hours from a cronjob. Mercy killing it before it has a chance to kill itself.
HTH | [reply] |
Re: HTTP Daemonology
by scottstef (Curate) on Aug 09, 2001 at 17:43 UTC
|
You may want to look at setting up spong It is written in perl, distributed under the Perl Artistic Liscense. Spong is from their faq:
1. What is Spong?
This is a simple system monitoring package called spong. It has the following features:
client based monitoring (CPU, disk, processes, logs, etc...)
monitoring of network services (smtp, http, ping, pop, dns, etc...)
grouping of hosts (routers, servers, workstations, PCs)
rules based messaging when problems occur
configurable on a host by host basis
results displayed via text or web based interface
history of problems
verbose information to help diagnosis problems
It may be a little overkill, but it will do what you want it to do and is very easy to set up.
"The social dynamics of the net are a direct consequence of the fact that nobody has yet developed a Remote Strangulation Protocol." -- Larry
Wall | [reply] |
|
|