There's more than one way to do things | |
PerlMonks |
Monolith: A Clever Tool For Monitoring Regularly Scheduled Tasksby bpoag (Monk) |
on Dec 19, 2013 at 17:06 UTC ( [id://1067838]=CUFP: print w/replies, xml ) | Need Help?? |
Where I work, we've recently had a big push to improve and modernize our approach to systems monitoring. I thought i'd take a little time to share some of the approaches we've come up with, and how they're benefitting us. In most medium-to-large production environments, you generally find one or more systems that have regularly scheduled jobs that run. Cron does a nice job of this, but suffers from one fatal flaw. It's not human. :) Should one of these regularly-scheduled jobs kick off and suddenly die or otherwise fail to run at all, it's up to you, or cron, to either funnel the results of stdout/stderr to someone, or up to the script itself to generate some sort of notification in the event that it was unable to run successfully. But, what happens when things stop working, and it's only after days, weeks, or months that it's noticed by anyone? We've actually developed a third way, called Monolith, to detect when this state happens. Suppose you gave each of your regularly scheduled jobs the ability to call home to a centralized database. After a while, a bit of a track record would begin to develop.. maybe after 3 or 4 executions....a track record that could tell you when the next invocation of that command can be expected to show up. Enter Monolith. Monolith is a two-part tool. The first part, a simple call-home script, takes just one argument -- an "entity name", which usually equals the name of the script itself. When run, it makes a connection to a MySQL database, and adds a row to a table saying, "Hi! I'm {entityName} on {host}, and it's currently {time} where I am". The other part of the tool is a script that watches this database, looking for instances when an entity has stopped calling home...In other words, if you know that entity "foobar.pl" usually checks in every 200 seconds, and the last time it checked in more than 200 than seconds ago, you know that there's a problem with foobar.pl...and an alert can be generated to that effect. Incidentally, we set our detection threshold at 20%..Meaning, if something that is known to check in every 100 seconds hasn't checked in for the past 120 seconds, an alert is generated. Here's the call-home script:
Here's what our front-end to Monolith looks like, in-house: http://i.imgur.com/kf7nYA6.png Our organization now has 200+ more pairs of automated eyes carefully ensuring that everything we have is working as expected, and alerting us when it's not. It's and already bared substantial fruit--On instances where something systemic had broken, it affected the ability of several scripts on several different hosts to run. It helped greatly to have a visual map of what was broken, so that we could be 100% confident that we've fixed the problem in every place. tl;dr - We have a tool that tracks regularly scheduled tasks to ensure they're calling home at regular intervals. When they deviate from the expected drum pattern they've created for themselves over time, or stop phoning home alltogether, we know about it immediately, versus being caught off-guard and finding out at some point down the road. Cheers, Bowie
Back to
Cool Uses for Perl
|
|