Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Tripwire: A Tool For Intelligent Parsing of Syslog Messages

by bpoag (Monk)
on Dec 20, 2013 at 17:59 UTC ( #1067970=CUFP: print w/ replies, xml ) Need Help??

As I mentioned in an earlier post, there's been a big push where I work to extend the horizon a bit when it comes to our systems monitoring. As a result, we've come up with a number of in-house tools to accomplish this. One is called 'Tripwire'.

(BTW, I was several weeks into development before someone casually mentioned that there's already a product out there called Tripwire. Oops! In any event, we still continue to refer to it as such, so, forgive me for any confusion. We're not talking about the network administration tool.)

Tripwire's job is simple. For anything that arrives on our syslogd server, analyze it, and tell us whether it's something out of the ordinary/looks suspicious/looks like something we should be concerned about. To pull this off, however, is a bit more complicated.

At the top of the chain, our syslogd server funnels everything it receives into a MySQL database. This is actually a feature of syslogd, not something we came up with. In any event, this database gets populated non-stop, at a rate of about 200 messages per second...The entirety of everything we can think to point syslog at, from SAN devices to VMWare, to host OS'es to our own apps, they are all told to funnel diagnostic messages to a syslogd server where they eventually end up in a MySQL database we can draw from.

Every 5 minutes, Tripwire wakes up, and grabs a list of "suspect words" we've given it (like 'fail', or 'error', or 'critical', etc.) along with a list of keywords and phrases we've instructed Tripwire to ignore, along with the last 5 minutes worth of messages that arrived in the MySQL database.

In Perl, we compare each line against the list of suspect words. The results of this comparison are then passed to another routine that attempts to exclude them based on the list of words and phrases we've told it to ignore. This list is rather long--At the time of this writing, there are over 300 rule filters in place that Tripwire compares every message against. The survivors of this filtering process are deemed worthy of notifying a human about.

By this point, you might be asking why we do it this way...The reason is, the only thing we know when it comes to error messages, is that there's no way of knowing what they're going to look like in advance. In other words, we can't go fishing for a specific fish; we need to, instead, catch all fish, and throw back the fish we don't want; The only thing we do know is what we don't care about.

As time goes on, and as the stack of filters becomes more and more efficient, the odds of getting a message that is both a) legitimate, and b) something we've never seen before gets smaller and smaller. After about 3 months, the only things we hear from Tripwire are nearly always legitimate, because it has become increasingly difficult for a message to survive the filtration process.

Occasionally, we do still get messages from Tripwire containing syslog messages that we don't care about, of course.. Each email contains an URL that, when clicked, instructs Tripwire to ignore similar messages in the future. The net result is an engine that grows increasing efficient with time thanks to a little encouragement from humans.

As of December 2013, Tripwire has parsed 2.1 billion syslog messages, and is 99.719% accurate when it comes to telling the difference between real problems and things that simply look real, but aren't. We collect all sorts of engine statistics as we go, in order to shed more light on problems when they do occur; For example, we track the engine behavior in terms of how many messages were received in the past 5 minutes, in graph form. If we see a spike in the graph indicating this number is unusually high, we have higher confidence as a team that the problem is real. We also track Tripwire's own decision making process, referring to it as a "signal to noise ratio".... Error messages that survive filtration are more likely to be legitimate if they occur within a wave of "suspect" messages. Normally, out of the 200 or so messages we recieve per second, about 1.5 messages per second catch Tripwire's attention and warrant further analysis. If the rate if incoming "suspect" messages is high, the likelyhood of messages surviving filtration being legitimate problems is also high. This sort of "confidence value" gives us an edge when dealing with the sort of on-the-spot problems that occur every so often where it's difficult to divine where within the organization the problem is occuring.

Here's a screenshot of the web front-end we built for our engine:

http://imgur.com/xlh4Ov6

There are plenty of opportunities for this model to grow beyond its current form. It wouldn't be too difficult, for example, for Tripwire to do its own functional analysis on certain problems, and decide on its own whether or not they're legitimate. For example, it's not uncommon for network glitches to temporarily render a box unreachable for a moment. If one of our other monitoring tools happen to ping this box at the exact same moment, it may think the host is offline, and generate a syslog message to that effect; Tripwire can be taught to look for this pattern via a simple regex, and ping the same host on it's own, for example, to see if the problem still exists before notifying us about it.

As a side perk, it's also kind of hilarious when Tripwire finds things wrong with our systems via syslog before the vendor-supplied monitoring tools do, or, finds things that the vendor-supplied monitoring solutions either miss, or fail to report. :)

tl;dr - We have a mechanism in-house that parses our incoming syslog stream, looking for keywords that look like they may be problems, and filtering them against a set of exclusion rules we've provided the engine. This is as close as we can get to having a crystal ball when it comes to monitoring, and its output gives us a heads-up on events that vendor-supplied monitoring tools cant, or wont. Proactive is better than reactive!

Cheers,
Bowie J. Poag

Comment on Tripwire: A Tool For Intelligent Parsing of Syslog Messages
Re: Tripwire: A tool for intelligent parsing of syslog messages
by MidLifeXis (Prior) on Dec 20, 2013 at 19:21 UTC

    Other than the collision with this Tripwire, I can see this being very useful. I have used this method in the past, although in a much more basic implementation, when I was managing a unix network providing services to the campus.

    --MidLifeXis

      Hey! Thanks--I love it, personally. It feels like we have a crazy good radar for system health, now. The only drawback is that it sort of takes time for the engine to "prime" itself, for lack of a better word. Several months worth of day-to-day events have to happen in order to really pair down the noise. Separating the wheat from the chaff simply takes a lot of time. After I initially built the engine, I probably spent the next 8 weeks or so allowing it to send things it felt were important to me, and me only. I built probably 70 or 80 filters before I felt it was useful for day-to-day consumption, and wouldn't spam the rest of the team with ignorable messages too often. Much of it has to do, like I said, with the fact that you never really know what an error message looks like. In enterprise environments, particularly, the potential list of error/failure/panic message permutations is enormous. It wont work if the engine is constructed to simply cherry-pick the kinds of error messages you want to see, because the remainder will fly by undetected. The best approach is reductive; collect what looks suspicious, but then tell the engine in detail, on an ongoing basis, what you don't care about. It's better to recieve an email with an innocuous failure message than it is to have a failure message go completely undetected. Thanks for the compliment, btw. :) I wish I could share the engine code, but I can't. My employer's property, obviously.. meh.
Re: Tripwire: A Tool For Intelligent Parsing of Syslog Messages
by zentara (Archbishop) on Dec 21, 2013 at 14:44 UTC
      Sorry. I guess I should have made it clear--The code I've written doesn't legally belong to me. It was written on my employer's time, with my employer's workstation, after all. I can't share it. What is free, however, is the description--from which a working model can be built and expanded upon. The engine code weighs in at a little over 300 lines, and could probably be done in less space than that. All that's needed from there is a database to hold a list of inclusions, a list of exclusions, and the messages themselves the engine will use both to operate upon.
        Could you provide more details?

        This does sound like something that would be good for monitoring automated scripts and processes that now send emails where I work. Could you expand on how this system differs from Nagios and related tools? Nagios uses (perhaps completely custom) scripts and tools to provide a status, and am pretty sure has the ability to store historical data in MySQL. It's default display also looks similar to your display board, with indicators of green/yellow/red. Understand, I'm not trying to be one of those people saying "why did you do this when you could have used X", I'm trying to think how your system differs, so that if I can get time to do an implementation at my own work, I don't end up recreating Nagios (badly).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1067970]
Approved by mtmcc
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-09-17 01:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (56 votes), past polls