|Perl: the Markov chain saw|
Tripwire: A Tool For Intelligent Parsing of Syslog Messagesby bpoag (Monk)
|on Dec 20, 2013 at 17:59 UTC||Need Help??|
As I mentioned in an earlier post, there's been a big push where I work to extend the horizon a bit when it comes to our systems monitoring. As a result, we've come up with a number of in-house tools to accomplish this. One is called 'Tripwire'.
(BTW, I was several weeks into development before someone casually mentioned that there's already a product out there called Tripwire. Oops! In any event, we still continue to refer to it as such, so, forgive me for any confusion. We're not talking about the network administration tool.)
Tripwire's job is simple. For anything that arrives on our syslogd server, analyze it, and tell us whether it's something out of the ordinary/looks suspicious/looks like something we should be concerned about. To pull this off, however, is a bit more complicated.
At the top of the chain, our syslogd server funnels everything it receives into a MySQL database. This is actually a feature of syslogd, not something we came up with. In any event, this database gets populated non-stop, at a rate of about 200 messages per second...The entirety of everything we can think to point syslog at, from SAN devices to VMWare, to host OS'es to our own apps, they are all told to funnel diagnostic messages to a syslogd server where they eventually end up in a MySQL database we can draw from.
Every 5 minutes, Tripwire wakes up, and grabs a list of "suspect words" we've given it (like 'fail', or 'error', or 'critical', etc.) along with a list of keywords and phrases we've instructed Tripwire to ignore, along with the last 5 minutes worth of messages that arrived in the MySQL database.
In Perl, we compare each line against the list of suspect words. The results of this comparison are then passed to another routine that attempts to exclude them based on the list of words and phrases we've told it to ignore. This list is rather long--At the time of this writing, there are over 300 rule filters in place that Tripwire compares every message against. The survivors of this filtering process are deemed worthy of notifying a human about.
By this point, you might be asking why we do it this way...The reason is, the only thing we know when it comes to error messages, is that there's no way of knowing what they're going to look like in advance. In other words, we can't go fishing for a specific fish; we need to, instead, catch all fish, and throw back the fish we don't want; The only thing we do know is what we don't care about.
As time goes on, and as the stack of filters becomes more and more efficient, the odds of getting a message that is both a) legitimate, and b) something we've never seen before gets smaller and smaller. After about 3 months, the only things we hear from Tripwire are nearly always legitimate, because it has become increasingly difficult for a message to survive the filtration process.
Occasionally, we do still get messages from Tripwire containing syslog messages that we don't care about, of course.. Each email contains an URL that, when clicked, instructs Tripwire to ignore similar messages in the future. The net result is an engine that grows increasing efficient with time thanks to a little encouragement from humans.
As of December 2013, Tripwire has parsed 2.1 billion syslog messages, and is 99.719% accurate when it comes to telling the difference between real problems and things that simply look real, but aren't. We collect all sorts of engine statistics as we go, in order to shed more light on problems when they do occur; For example, we track the engine behavior in terms of how many messages were received in the past 5 minutes, in graph form. If we see a spike in the graph indicating this number is unusually high, we have higher confidence as a team that the problem is real. We also track Tripwire's own decision making process, referring to it as a "signal to noise ratio".... Error messages that survive filtration are more likely to be legitimate if they occur within a wave of "suspect" messages. Normally, out of the 200 or so messages we recieve per second, about 1.5 messages per second catch Tripwire's attention and warrant further analysis. If the rate if incoming "suspect" messages is high, the likelyhood of messages surviving filtration being legitimate problems is also high. This sort of "confidence value" gives us an edge when dealing with the sort of on-the-spot problems that occur every so often where it's difficult to divine where within the organization the problem is occuring.
Here's a screenshot of the web front-end we built for our engine:
There are plenty of opportunities for this model to grow beyond its current form. It wouldn't be too difficult, for example, for Tripwire to do its own functional analysis on certain problems, and decide on its own whether or not they're legitimate. For example, it's not uncommon for network glitches to temporarily render a box unreachable for a moment. If one of our other monitoring tools happen to ping this box at the exact same moment, it may think the host is offline, and generate a syslog message to that effect; Tripwire can be taught to look for this pattern via a simple regex, and ping the same host on it's own, for example, to see if the problem still exists before notifying us about it.
As a side perk, it's also kind of hilarious when Tripwire finds things wrong with our systems via syslog before the vendor-supplied monitoring tools do, or, finds things that the vendor-supplied monitoring solutions either miss, or fail to report. :)
tl;dr - We have a mechanism in-house that parses our incoming syslog stream, looking for keywords that look like they may be problems, and filtering them against a set of exclusion rules we've provided the engine. This is as close as we can get to having a crystal ball when it comes to monitoring, and its output gives us a heads-up on events that vendor-supplied monitoring tools cant, or wont. Proactive is better than reactive!