Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Once AGAIN perl saved my bacon

by sierpinski (Chaplain)
on Sep 08, 2009 at 14:25 UTC ( #794146=note: print w/replies, xml ) Need Help??

in reply to Once AGAIN perl saved my bacon

We have multi-million dollar applications running on Solaris servers, and one day one of the systems went haywire. Normally it has something like 16 CPUs with 128 cores and 96GB of ram. Half of the system boards died (or so we thought) and the system just screeched to a halt. So much swapping and icsw's almost halted processing totally. We got the parts replaced and the system back up, but nowhere near soon enough to avoid hefty (in the millions) fines from the government for not having this data available.

It turns out several of the CPUs had gone offline before and we had only lost 1 system board to cause our issue. If we had replaced the failures as they occurred, it would have never brought the system down as bad as it did. We didn't have any monitoring in place to detect failed components, but now we do. I wrote this massive monitoring script (in Perl of course) that uses the Expect module to connect to each server, run a battery of checks, and then emails a report to our group twice a day. Now we find and can fix these minor problems before they escalate into major ones, and several of the upper level executives have been briefed on my work. Still a work in progress, they are always finding new things for me to check!
  /\/\ Sierpinski

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://794146]
[karlgoethebier]: takes some beef from the sideboard

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2017-04-28 11:16 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (521 votes). Check out past polls.