|go ahead... be a heretic|
Halogen: A Tool For Monitoring IBM pSeries LPAR and Hypervisor Metricsby bpoag (Monk)
|on Dec 31, 2013 at 17:49 UTC||Need Help??|
This post is the third i've made recently regarding an effort where I work to get a better handle on systems monitoring. In this post, we'll discuss Halogen, an in-house tool we've developed in Perl to extract, parse, and make sense out of hidden performance metrics from IBM pSeries machines.
For anyone who's worked with IBM's pSeries line, you're probably familiar with the concept of LPARs. You're probably also familiar with the fact that IBM doesn't like people snooping around at the hypervisor layer without their blessings, let alone permission.
Halogen is a tool that allows you as an administrator to peek into the black box, and monitor LPAR and hypervisor metrics in near-realtime.
These days, doing LPAR virtualization requires a device called an HMC in order to manage it. The HMC is presented to the customer (you) as a walk-up 1U console that optionally can be accessed remotely via a web front-end. You might be surprised to learn that the HMC also gathers and stores a whole slew of different metrics behind the scenes for IBM, for the purposes of problem determination and analysis when things go wrong.
Lucky for us, the HMC accepts ssh connections, and allows users to authenticate using the same credentials used for logging into the web frontend. There's also a small set of commands available in the shell that aren't generally given to customers, 'lslparutil' and 'lshwres'. Between these two commands, we can automate the process of passively extracting and making sense out of performance metrics, live, as they are recorded.
It's just a guess on my part, but, i'm guessing IBM doesn't really advertise the fact that the HMC collects and stores performance metrics because the metrics themselves are horribly confusing and obtuse. There's also quite a bit of them; an individual sampling of one moment in time may return upwards of 50 different values. The only way we've been able to make use of these metrics is by careful examination of the data itself, to see how the numbers themselves vary from moment to moment. I would call it back-engineering if it weren't for the fact that at least the table columns are named, albeit poorly.
As I stated above, Halogen relies principally on two commands.. lslparutil and lshwres. The first one dumps the performance metrics, while the second allows you to make sense of what you're seeing from a configuration standpoint. Between the two, it's also possible to do a little bit of data correlation, and thus build a picture of hypervisor-layer statistics. Halogen wakes up every 5 minutes, uses ssh-pass to collect the last 5 minutes worth of metrics recorded on the HMC, and dumps the raw data into a MySQL database with column names that match the names given in the data. From there, the parts we're interested in are parsed.
Part of the difficulty in making sense of the data pumped out by lslparutil is how IBM chose to record the values from sample to sample. First off, they're only dumped to disk on the HMC perodically...(as best we can tell, roughly every 5 minutes give or take), which imposes both a limit on data granularity as well as a limit to how fresh a given set of readings presented to the user can be; To make matters worse, the values which end up being recorded are NOT recorded in terms of deltas accumulated since the last sample pass; they are more akin to meter readings, so, in order to make sense out of the data and get the deltas you're looking for, your script must either remember key points or look back at previous values within an output stream. Did I mention the contents of the stream may or may not be in chronological order, may or may not contain mentions of the resources you're looking for, and may not even be fully populated as well? :)
Figuring out pSeries metrics is a mess. I would liken it to figuring out how fast a car is going by measuring how much the odometer reading has changed in relation to the angle of the car's shadow on the road. You can figure out how much time has elapsed by using the car's shadow as a sundial, and and use the odometer to measure distance. If you know the two of those, then you can divine the car's speed. Thankfully, IBM's "odometer" and "shadow angle" measurements are extremely precise, so you'll get good results in the end....but it's horribly clunky. I'm at a loss to explain why IBM does it this way.
How clunky? Here's an example.. Here's the code snippet in Halogen that determines real processor utilization for a given LPAR. Doing so requires these sort of gymnastics be done mathematically:
In English, if you want to know how much CPU a given LPAR is chewing up at any given point in time, you need to get a sum... of the differences.... between the capped cycle count, and uncapped cycle count, between now and the last viable instance of each cycle count.... divided by the difference in entitled cycle count between then and now.........times a hundred. :) Again, why IBM did it this way versus simply dump a value that reflects the impact the LPAR is having on the overall resource pool, I have no idea.. But this is how we're able to divine what the picture looks like at the hypervisor layer; it's a matter of finding the right jigsaw puzzle pieces, and reconstructing what the data isn't being straightforward about.
To handle the flood of data, we basically dump every line into a MySQL database, both to make querying the data simpler (why have Perl do the work of parsing when you can offload the work to the SQL server?) as well as giving us historical metrics we can look back on to determine growth patterns for planning purposes.
Despite it all, it is nonetheless possible to build a script in Perl intelligent enough to parse through the mountain of data being supplied by lslparutil, within the context of lshwres, and recreate the data you need. I'd imagine that lpar2rrd does this same trick to some extent, but, what lpar2rrd lacks, Halogen makes up for. Once a sane and clear picture of system performance can be obtained, it's possible to do reporting.. And with reporting, alerting.
For example, let's say we know from lshwres that four LPARs Alpha, Beta, Gamma and Delta are on a given pSeries box. By parsing their combined real CPU usage metrics out of lslparutil, we can infer the overall CPU load being placed on the cores by each LPAR. By seeing what load we're placing on the cores, we then have a valuable piece of information we can graph over time to see everything from if performance drag is being caused by CPU pool depletion to whether or not we need to buy new hardware to handle future demand. Same goes for alerting. In our particular setup, Halogen alerts us if a given pSeries box has greater than 95% CPU utilization for more than 5 minutes. This gives us a heads-up to when our customers may begin seeing performance degrade, and perhaps offload one or more LPARs to more idle servers to free up resources.
We've also built a front-end to Halogen that allows us to view metrics in terms of groups; all of our VIO servers, for example, are in one view... all of our database servers in another.. all of our app servers in another... So we can keep an eye on multiple systems that have a shared impact.
At some point in the next few months, we're going to explore the possibility of having Halogen automate the process of dynamically moving LPARs around via Partition Mobility to quieter systems in the manner mentioned above -- That if Halogen sees a pSeries box being overburdoned for too long, it will attempt to mitigate the issue by PM'ing the LPAR somewhere else, continually keeping all of our pSeries boxes at roughly equal utilization. Here's how it looks:
I keep this panel up during work hours, just to keep tabs on what's going on globally. It's nice to be able to have answers on-hand when someone comes by asking if a given server or application seems slow. It also allows us to call BS on vendors who claim our systems aren't keeping up with the demands of their products. Perhaps most importantly, we have eyes where we did not have them before, and can administer all of our systems in a more intelligent fashion.
Bowie J. Poag