Introduction

Last month I started this series, and we’ve actually covered quite a bit.  If this is your first viewing, I encourage you to take a look at the previous installments…

In Part 1 (link here) I discussed the first ingredient we used in our recipe for designing an effective Single Pane of Glass.  A quick recap:

  • For the purposes of this series, APM is referring to Application Performance Monitoring, a segment of IT Performance Management.  Most of the principles we’ll be discussing here will be more related to ITPM, though we’ll touch on a lot of APM topics as well.
  • The first ingredient of the 4 part recipe is being “Data Agnostic” meaning, we needed to design our solution to be able to import data from any ITPM-centric datasource.

In Part 2 (link here) I discussed an important philosophy for problem solving: The Top-Down Approach.  A quick recap:

  • IT Solutions are too complex and interconnected to be focused on a single component unless you know that is the source of an overall problem
  • A Top-Down approach would allow a user to look at the entire IT landscape, and drill in where a potential problem is indicated, saving time and enhancing problem correlation, even between seemingly unrelated systems

In Part 3 (link here) I discussed the requirement of an effective single pane of glass to be as clutter-free as possible.  A quick recap:

  • Most tools take you into the weeds too soon, before you have a good idea of what the problem is
  • Use a philosophy of less is more, meaning: show the user only what they need to see to make the decision on if they need to dig deeper.  Don’t overwhelm them!

While all of these are paramount the the success of a SPoG effort, if your tools doesn’t follow this last ingredient, it is useless no matter how amazing the rest of it may be…

Make a Dashboard that will Tattle on Itself

Crazy, right?  What vendor in their right mind would ever put in code to call itself out when something isn’t working quite right?  Well, we’ve never been accused of being in our right mind, so I guess that means we can get away with it.  :-)  But in all seriousness, there’s two very real cardinal rules in designing a dashboard if it will ever be intended as a reliable source of information.

1. Never turn something RED unless it is clear that there is really a problem

Simple right?  Don’t cry wolf.  Almost every vendor out there understands this.  If a tool is constantly screaming about a problem that doesn’t exist, the support personnel who are tasked with responding to issues will become deaf to the faulting alerting mechanism even when something finally is happening.  But this pales in comparison to the second rule… a rule that almost nobody else in the industry adheres to… that is:

2. Never turn something GREEN unless it is clear that there really is NOT a problem

In my decade and a half of implementing the solutions of my competitors, I have yet to see a single dashboard (besides ours of course) that follows this rule.  Most take the approach that “no news is good news!”  This is NOT ok.  In one client I had, I saw a green bulb indicating a system was running fine when I knew that there was a major issue and in fact, the system was down.  When I looked into why the green light was telling me sweet lies, I found that the data feed going into the bulb had been broken for weeks.  Let me really let that sink in… this bulb was happy to just continue saying everything was peachy, when it hadn’t heard anything for certain in weeks.  Had that been a person reporting everything was good when it wasn’t… they’d have been fired.

I wish I could say that is the only story I have for that type of situation, but it is just the tip of the iceberg.

So how can anyone trust these tools?  Easy, you build in rules to say how often you’d expect to see new data, and alert someone when those thresholds are breached.  Take a look:

UIC tattle1

Hovering over the name of the module will show how often it should be calculated and how long before that data is considered “stale.”  What does it mean to have stale data?  It simply means that the data shouldn’t be trusted as current.  These variables are configurable depending on the type of data being analyzed.  Thus, End User Performance might have a shorter stale time (15 minutes) and a daily summary might have a stale time of say 24-48 hours.

All that is well and good, but unless the dashboard actively tells you that something isn’t reporting as planned, all that configuration is useless.  That is why the hover-overs on the bulbs themselves tell an even more detailed story:

UIC tattle2

In the well-behaved column, we can see that the last time EMR Front End was anything other than green (its current color) was 14 minutes ago.  The dashboard itself last checked with the server to see if there were any updates almost 3 minutes ago, and the last time the server calculated the bulb’s status was a little over 3 minutes ago.  Therefore, the dashboard knows that since it is only ever calculated every 5 minutes, no need to bother checking until it is due to be calculated again.

Now to the “Bad Module”… notice the red exclamation points next to the grey bulb.  Let’s break that down a bit… First, grey bulbs mean that the dashboard has no data to go on to make a determination as to the health of the application from the module’s perspective.  That could be because the module is dealing with a Unix specific thing and that application only runs on Windows… or it could be because no data is there to make a judgement call.  The red exclamation point indicates that the last time the server successfully calculated the value of the bulb was actually quite awhile ago.  Almost 5.5 days in this case.  This bulb actually had a clock on it when it was close to its threshold of being stale, and now has had the exclamation mark next to it for well over 5 days. (Don’t worry, we made it do that on purpose… just for you!) :-)

So what’s all that mean, really?  Simply put: the dashboard has no agenda.  Its job is to observe and report.  If it can’t do its job, it will tell you.  If it can, it ensures that the data it gives you is clear and correct.  That is how trust is built in a tool like this.  Without that trust, any tool is useless.

Bringing it all together…

When you mix together the philosophies of: being data agnostic, taking a top-down approach to problem solving, no-clutter design, and an honest-even-if-I’m-the-problem mentality then the technology certainly exists to make an exceptional single pane of glass for your ITPM initiatives.  We’ve spent a lot of time figuring out where others went wrong, and we did all we could to ensure that we learned from the mistakes of our peers.  We’re confident that we’ve hit upon a philosophy and product that can deliver where others couldn’t.  As such, we’re the only APM or ITPM company who is prepared to offer a 100% money back guarantee and a free proof of concept.  We love what we do, and we’d love to show you how this type of technology can actually work for you.

ABOUT THE AUTHOR
Matthew Bradford has been in the I.T. Performance Business for 15 years and has been critical to the success of many Fortune 500 Performance Management groups. He is currently the CTO of InsightETE, an I.T. Performance Management company specializing in passive monitoring and big data analytics with a focus on real business metrics.