It’s 5:45am and your phone is going off. It is another text message from that monitoring tool that just won’t quit waking you up. You know that this happens every day at the exact same time, but you have to get up and check on it regardless. What if this time something is ACTUALLY wrong? If the page came in at 3:00am then there would be no need to worry, it is hours before the application complaining would even be used. But at 5:45am, it is just cutting it too close. People start logging in at 6:30am!! So you wake up, log in, and see the same thing you see Every. Single. Day. The backup reached the point where it is wrapping things up, the CPU load average is up higher than the threshold and so the monitoring tool thinks the sky is falling, but… of course… nothing is wrong.

Well, this is the only way there was 5 years ago. Sure, some tools allowed you to set specific thresholds for specific servers. Even fewer allowed you to set up thresholds based on different times of day. But really, let’s be honest with ourselves… who on Earth is going to spend the time to configure each one of those 5,000 servers you monitor with those custom thresholds? So, 5 years ago you had to choose:

Spend way too much time configuring all 5,000 servers only to start all over once you’re done because the specific needs of 30% of them changed in the time it took to configure them all once.
Set your thresholds for the worst case scenario. Unfortunately, this eliminates all possibility of an early warning and forces you to be in firefighting mode all the time.
Leave your thresholds on the sensitive side, but lose sleep because the system will continue to shoot out false positives (cry wolf!)
Those options aren’t great. And that is why it is so important to have a solid system that can handle statistical analysis and detailed baselines for your metrics. Now before you run off and buy the first thing that sports those shiny buzz words… let’s dig down into what you really need:

1. Variable Length Baseline Collection

First, a baseline is simply a collection of metrics that are assumed to be “normal” and then can be used to compare how the metric is doing in comparison to its “normal” behavior. In statistics, sample size is extremely important. If the baseline is only made up of a small segment of data, then the baseline is useless. Make sure the baseline is made up of a sufficient amount of data. The other side of that can be a problem too, if you have too much data in your baseline and you’re not retiring or phasing out old data then those “New Normals” get lost as applications and servers are intentionally changed.

2. Ensure Baselines Match Your Business

Often a Monday at 9am looks very different from a Saturday at 9am. Pushing 7 days of data into a single baseline is likely not going to be of any real value. You need to know how your application behaves on a Tuesday at 5:00am, not an overall view of how it behaved in the past week. Do you do month end processing? Is the last day of the month crazy? The last Friday? The first Monday? Make sure you can capture those key differences in your baselines.

3. Averages Only Tell a Piece of the Story

Most tools and people just rely on averages. They are easy to understand and do indeed hold a ton of value. They are an excellent way to summarize an incredible amount of data into an easy to understand number. However, consider the following:

The performance of an application averaged at 1.2 seconds per page load for an hour. Customers were happy and productive. Life was good. Then the next hour, all hell broke lose. Calls were coming in left and right complaining of slow response. The average for the hour from hell? 1.18 seconds. How could this possibly be? The monitoring must be broken, the entire investment must be for naught! Or… perhaps… the average is being misleading.

Standard Deviation can help save the day in those situations. It helps determine not only what the average was, but also how consistent of an experience was delivered. So those same stats with standard deviation might look very different.

Hour 1: Average 1.2, Std Dev 0.2 — This means that within the average, most values were between 1.0 and 1.4.

Hour 2: Average 1.18, Std Dev 15.6 — This means that within the lower average, values ranged from near instant to 16.78 seconds! Something is clearly wrong here.

4. Give Your Statistics a Retirement Plan

We recommend using a long history of data (6 to 18 months) but weight the averages in favor of the newest data. That way, new data describing a “new normal” can take over the baselines more quickly, but the historical influence is still felt. It is a nice balance between keeping current and nimble and keeping your sample size healthy.

So what does all this get me?

It gives you the ability to have a set-it-and-forget-it attitude about your thresholds in your monitoring implementation. Over a little time, each server/application/whatever will alert based on the history of the metric and be the unique snowflake you know they are. The alerts will be about anomalies within your environment. In short, more of your alerts will be actionable and you’ll only be woken up when something is actually wrong. All of this without having to spend tons of time setting up everything individually. So let’s review:

  • More sleep
  • Less configuration
  • Less frustration
  • More time to do the cool stuff
  • Sanity restored!

Matthew Bradford has been in the I.T. Performance Business for 15 years and has been critical to the success of many Fortune 500 Performance Management groups. He is currently the CTO of InsightETE, an I.T. Performance Management company specializing in passive monitoring and big data analytics with a focus on real business metrics.