We have multi-million dollar applications running on Solaris servers, and one day one of the systems went haywire. Normally it has something like 16 CPUs with 128 cores and 96GB of ram. Half of the system boards died (or so we thought) and the system just screeched to a halt. So much swapping and icsw's almost halted processing totally. We got the parts replaced and the system back up, but nowhere near soon enough to avoid hefty (in the millions) fines from the government for not having this data available.
in reply to Once AGAIN perl saved my bacon
It turns out several of the CPUs had gone offline before and we had only lost 1 system board to cause our issue. If we had replaced the failures as they occurred, it would have never brought the system down as bad as it did. We didn't have any monitoring in place to detect failed components, but now we do. I wrote this massive monitoring script (in Perl of course) that uses the Expect module to connect to each server, run a battery of checks, and then emails a report to our group twice a day. Now we find and can fix these minor problems before they escalate into major ones, and several of the upper level executives have been briefed on my work. Still a work in progress, they are always finding new things for me to check!