Introduction

In Part 1 (link here) I discussed the first ingredient we used in our recipe for designing an effective Single Pane of Glass.  A quick recap:

  • For the purposes of this series, APM is referring to Application Performance Monitoring, a segment of IT Performance Management.  Most of the principles we’ll be discussing here will be more related to ITPM, though we’ll touch on a lot of APM topics as well.
  • The first ingredient of the 4 part recipe is being “Data Agnostic” meaning, we needed to design our solution to be able to import data from any ITPM-centric datasource.
  • This design isn’t so much a function of splendid technical design, but was instead a philosophical choice made by InsightETE to better serve our customers.

So without further ado, the second ingredient in our recipe is…

Change your perspective from bottom-up to top-down

What do I mean by that?  Simple: put the end user experience first in line when looking at any issue.  Why?  That is also pretty simple: IT serves business, end users are a requirement of business, thus… focus on their experience.  Thankfully, this is a component that most major vendors have adopted to some degree.  Unfortunately, most still keep it too close to the weeds to be easy to understand.  Looking at a single application is almost never going to give you a clear view on how the overall enterprise it doing.  Most businesses have several places end users go.  Thus, in order to monitor the entire end user experience, you’ll need a view that can encompass the entirety of where end users interact with your business.  No doubt about it though, that same tool should be able to drill down into the details, but the starting position needs to be high enough that one can see the entire landscape.

“I know it is here somewhere!” – That guy, probably

Consider the complexity of today’s IT solutions… when someone is too focused on a single piece of the overall picture many key bits of information are lost.  Is one application slowing down another one because the first is taking up too much bandwidth on the network?  Are two servers attached to the same SAN going down because the SAN is having issues?  The correlation of seemingly unrelated systems can sometimes mean the difference between a speedy root cause and a wild goose chase that lasts for weeks.

Now consider a single application service-based view that looks at all the applications at the same time.  Issues bubble up to the top layer and you can instantly see the full impact of an issue, and drill down to the most likely root cause just from the top level executive screen!  From there, drilling down into the detail confirms the suspicions, but now you’re drilling down with a great idea of what it is you’re looking for.  That is the power of a top down approach.

So going back to our own dashboard… let’s take another look:

UIC avail

Here we can see the EMR Front End tier is having some issues.  Using the top down approach we can see a few key items:

  1. Both the EMR Front End tier and the ProMedica system appear to have volume issues.
  2. There was a recent Service Availability disruption in the EMR Front End tier (indicated by the yellow bulb), there is no data relating to the Service Availability of ProMedica (indicated by the grey bulb)
  3. There is a problem with the active servers (CPU/Disk/Memory) of the EMR Front End tier and with the ProMedica system
  4. The successful transactions are not impacted from a performance perspective

From this, we can draw some correlations…

  1. Because I know that ProMedica is a subsystem on the server farm that runs the EMR Front End tier, I know that these problems are related.
  2. A volume differential can be assumed that it is too high for the specified time period because CPU/Disk/Memory counters are going off, and it says right there that there was recently an availability issue

As a result, I now know that I need to check the server farm running the EMR Front End tier for further issues, even though I can be pretty sure that this particular issue will clear up on its own now that the service availability issue is resolved.  That said, it may be worthwhile to forward some of the detail data on transaction volumes and CPU/Memory/Disk utilization to the capacity planning/management team to ensure that the environment is indeed fault tolerant.

Now, while I would have needed to dig a little deeper to confirm those conclusions… my journey would be very targeted, and much quicker than if I was just looking at any one of those metrics out of context.  That is the power of the true top-down approach!

Up Next…

Next week we’ll be talking about the 3rd ingredient in our recipe: “Get Rid of the Clutter!”  We’re pretty big fans of Occam’s Razor around here.  See you then!

ABOUT THE AUTHOR
Matthew Bradford has been in the I.T. Performance Business for 15 years and has been critical to the success of many Fortune 500 Performance Management groups. He is currently the CTO of InsightETE, an I.T. Performance Management company specializing in passive monitoring and big data analytics with a focus on real business metrics.