Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Monolith: A Clever Tool For Monitoring Regularly Scheduled Tasks

by bpoag (Monk)
on Dec 19, 2013 at 17:06 UTC ( #1067838=CUFP: print w/ replies, xml ) Need Help??

Where I work, we've recently had a big push to improve and modernize our approach to systems monitoring. I thought i'd take a little time to share some of the approaches we've come up with, and how they're benefitting us.

In most medium-to-large production environments, you generally find one or more systems that have regularly scheduled jobs that run. Cron does a nice job of this, but suffers from one fatal flaw. It's not human. :) Should one of these regularly-scheduled jobs kick off and suddenly die or otherwise fail to run at all, it's up to you, or cron, to either funnel the results of stdout/stderr to someone, or up to the script itself to generate some sort of notification in the event that it was unable to run successfully. But, what happens when things stop working, and it's only after days, weeks, or months that it's noticed by anyone? We've actually developed a third way, called Monolith, to detect when this state happens.

Suppose you gave each of your regularly scheduled jobs the ability to call home to a centralized database. After a while, a bit of a track record would begin to develop.. maybe after 3 or 4 executions....a track record that could tell you when the next invocation of that command can be expected to show up.

Enter Monolith. Monolith is a two-part tool. The first part, a simple call-home script, takes just one argument -- an "entity name", which usually equals the name of the script itself. When run, it makes a connection to a MySQL database, and adds a row to a table saying, "Hi! I'm {entityName} on {host}, and it's currently {time} where I am". The other part of the tool is a script that watches this database, looking for instances when an entity has stopped calling home...In other words, if you know that entity "foobar.pl" usually checks in every 200 seconds, and the last time it checked in more than 200 than seconds ago, you know that there's a problem with foobar.pl...and an alert can be generated to that effect. Incidentally, we set our detection threshold at 20%..Meaning, if something that is known to check in every 100 seconds hasn't checked in for the past 120 seconds, an alert is generated.

Here's the call-home script:

#!/usr/bin/perl ## ## Monolith written 082213 by Bowie J. Poag ## ## Monolith is a mechanism that allows regularly-scheduled scripts to +be monitored remotely. For every entity (script, command, whatever) y +ou want monitored, call this command. ## ## Usage: ./monolith.pl <scriptname> ## ## Example: /usr/local/bin/monolith.pl tripwire ## ## English: Make an entry in the Monolith database saying that Tripwir +e just ran. ## use Mysql; $monolithDBHandle=Mysql->connect('tmcpmonitordb','Monolith','xxxxx','x +xxxxxxxxxx'); if ($monolithDBHandle==0) { print "Monolith: Unable to connect to DB.\n"; } $timeStamp=time; $hostName=`hostname`; chomp($hostName); $entityName=$ARGV[0]; $checkExistQuery=$monolithDBHandle->query("SELECT * FROM Entities WHER +E entityName='$entityName' AND entityHostName='$hostName';"); while (@checkExist=$checkExistQuery->fetchrow_array) { $lastSeen=$checkExist[3]; $x++; } if ($x==0) ## No rows returned. Hmm. This means we're checking in for +the first time, so, let's create a new entry for ourselves in the DB. { $updateMonolith=$monolithDBHandle->query("INSERT INTO Entities + (entityName,entityHostName,entityLastSeen,entityFrozen) VALUES ('$en +tityName','$hostName','$timeStamp','0');"); } else { $updateMonolith=$monolithDBHandle->query("UPDATE Entities SET +entityLastSeen='$timeStamp' WHERE entityName='$entityName' AND entity +HostName='$hostName';"); } $entityDelta=$timeStamp-$lastSeen; $updateMonolith=$monolithDBHandle->query("INSERT INTO Events (timeStam +p,hostName,reportingEntity,reportingDelta) VALUES ('$timeStamp','$hos +tName','$entityName','$entityDelta');");

We have taken this idea, the ability to predict when something should have called home, but hasn't, and greatly expanded upon it. Monolith is now a status dashboard that gives near-realtime status on over 200 different entities running across about 30 different hosts. To begin monitoring anything, all it takes is adding a single line to the script you want monitored, and you're done. A more clever use would be to only call home to Monolith if the script was successful; that way, if the script ran but failed operationally for some reason, that can be detected and resolved. Anything which runs at regular intervals, and whose state can be conveyed in terms of on/off, successful/not successful, or present/not present, can be visualized.

Here's what our front-end to Monolith looks like, in-house:

http://i.imgur.com/kf7nYA6.png

Our organization now has 200+ more pairs of automated eyes carefully ensuring that everything we have is working as expected, and alerting us when it's not. It's and already bared substantial fruit--On instances where something systemic had broken, it affected the ability of several scripts on several different hosts to run. It helped greatly to have a visual map of what was broken, so that we could be 100% confident that we've fixed the problem in every place.

tl;dr - We have a tool that tracks regularly scheduled tasks to ensure they're calling home at regular intervals. When they deviate from the expected drum pattern they've created for themselves over time, or stop phoning home alltogether, we know about it immediately, versus being caught off-guard and finding out at some point down the road.

Cheers,

Bowie

Comment on Monolith: A Clever Tool For Monitoring Regularly Scheduled Tasks
Download Code
Re: Monolith: A Clever Tool For Monitoring Regularly Scheduled Tasks
by bpoag (Monk) on Jan 04, 2014 at 00:03 UTC
    Someone asked in a different thread:

    This does sound like something that would be good for monitoring automated scripts and processes that now send emails where I work. Could you expand on how this system differs from Nagios and related tools? Nagios uses (perhaps completely custom) scripts and tools to provide a status, and am pretty sure has the ability to store historical data in MySQL. It's default display also looks similar to your display board, with indicators of green/yellow/red. Understand, I'm not trying to be one of those people saying "why did you do this when you could have used X", I'm trying to think how your system differs, so that if I can get time to do an implementation at my own work, I don't end up recreating Nagios (badly).

    We looked at Nagios initially, and were convinced pretty quickly that it was unmanagably obtuse. What kind of sealed it for us was a flowchart we found, part of the Nagios documentation, that explained the rats nest of configuration files that needed to be tweaked in order to accomplish even the most basic monitoring tasks. Nagios may have been a good solution when it first came out, but...by virtue of trying to be all things to all people, it seems to have grown to the point where it ceases to be effective at its core task. Nagios has become the iTunes of monitoring. Sometimes you just want to play a song, not manage your iPad firmware and shop for gift cards.

    Monolith trumps Nagios in several areas. First and foremost is ease of deployment. Suppose I have a script that's being called by cron somewhere. All I need to do is add a single line to that script, and that's it. The call-home script takes care of informing Monolith that it should be watched. Usually, you want to place this single call-home line at the end of your script, or at the point in the script where operational success versus operational failure is determined. More on that in a moment. Literally, all you do is add one line:

    system("/usr/local/bin/monolith.pl myscript");

    When invoked, monolith.pl looks to see what the local hostname is where it's running. It uses this in conjunction with the argument you supply ("myscript" in this case) to check to see if it has called home before. If it hasn't, it adds "myscript on {hostname}" to the list of entities who's "drumbeat" is to be monitored. If this database already has mentions of "myscript on {hostname}", it simply adds a new row in the table saying "Hi, i'm myscript on {hostname}.. Just checking in.. It's currently {time} right now." ....And that's it. As I described above, the dashboard piece of the solution looks at this table, and by virtue of the track record being created by a script calling home repeatedly, can deduce when the script is noticably overdue. It's like a parent with a kid in college; they expect their child to call home on sunday nights...they've called home every sunday night at 8:00 PM for the past 6 months....8 PM sunday rolls around, and the phone doesn't ring.. After about 8:30, the parents become concerned. After 10PM, they get worried and start thinking something's wrong.. Monolith works on the same premise. It looks at who's calling home, and how frequently they do it....and if the thing calling home strays far enough from that established pattern, it throws a notification that there's something wrong. (BTW, Monolith will only begin actively monitoring an entity after that entity has called home at least 4 or 5 times, so that a reliable call-home frequency can be calculated.)

    This is the second area where Monolith trumps Nagios; The model/method of monitoring; In Monolith, the process of monitoring entities is no longer reliant upon a given script's ability to inform you of its own status. It is deductive, versus reactive. In a reactive model, you can't always guarantee that the thing responsible for communicating it's status will do so, or be capable of doing so. In a deductive model, you can determine whether something is running successfully or not completely independently of the condition of the network, the host, or the script itself. Nagios won't be able to help you much if the thing responsible for reporting is unable to call home for a variety of reasons... network outage, broken modules/libraries, unforseen conditions, bugs.. These sort of things potentially stand in the way of the script notifying you of trouble. By moving the point of responsibility up the chain, the script is alleviated from having to do any communication whatsoever to communicate its status to the user.

    Anything which can be expressed as a Good/Bad, On/Off, Up/Down, Present/Not Present, Success/Failure state can be conveyed in Monolith simply by instructing a script to call home on in positive conditions, and not calling home in negative conditions. When the script stops calling home, Monolith notices it, and informs you.

    In my experience, deductive monitoring is way, way better than reactive monitoring. Nagios, at least as far as I understand it, is incapable of anything other than reactive monitoring; It can only tell you about information it receives, not information that it has deduced on its own.

    (Fun side note: I read a book recently, written by a guy named Bill Bruford, the drummer for "Yes" from 1969-1972 or so. His approach to drumming is kind of the same approach that Monolith's takes toward monitoring. Bruford considers drumming as the management the time inbetween drum beats, rather than the execution of the drum beats themselves.. It's sort of an inverse view of the same activity, and one that opens the door to all sorts of different creative possibilities.)

      sounds interesting. any plan to open-source the server component? thx, j
      Just found something similar...

      It looks like Dead Man's Snitch is attempting to provide a service that does something similar to what you've built. I came across it on the excellent One Thing Well site.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1067838]
Approved by boftx
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (10)
As of 2014-09-22 15:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (198 votes), past polls