IT Performance Management is more than just a single tool or technology. It is a set of disciplines that all come together to drive down the cost of errors in IT environments. The idea is to try and predict the future with Performance Testing, keep a finger on the pulse of the production environment with Production Monitoring, and long term goals are generally covered with Capacity Planning. All of it is tied together by the reporting end of the discipline because, let’s face it, all the data in the world is useless if we can’t make heads or tails of it. So, let’s take a look at the four main disciplines in a little more detail:
(Direct connections: Capacity Planning, Monitoring)
Quite simply, testing the application before it hits production to ensure it can withstand the expected production load just makes good sense. Monitoring and performance testing have a unique relationship in that, when done properly, should always prove the other’s validity. It helps to catch memory leaks in applications, bottle necks in the various processes of multi-component applications, and forces us to truly understand how the application is being used in the real world.
Going a step further with performance testing allows us to run saturation tests (stress tests, break-point tests, etc) to see how much more load than expected the current hardware/software configuration can take. This data is critical to good capacity modeling efforts. As application usage grows we have a definitive upper-limit to plot against.
In short, performance testing gives us an excellent view of the application before it is live. It gives us the chance to catch production problems before they are production and gives us the insight to see how much an application can grow before reworks/new hardware will be needed.
(Direct connections: Performance Testing, Capacity Planning, Reporting)
Monitoring looks at the application as it exists in the real world. There are dozens of products that all claim to do unique types of monitoring, but when it comes down to it you have 3 main categories:
- End User Experience monitoring – This includes HP’s BPM and RUM, Compuware’s APM line, and InsightETE’s product line among others. The idea is to capture how the end user perceives the availability and performance of the application. This can be done by generating synthetic transactions or by monitoring the application passively through either an agent on the client’s machine or a packet sniffing device that can interpret the network traffic into business processes. This is a great first response monitoring tool that can do a lot to identify a problem, but is ill-equipped to pinpoint the issue without correlation from other monitoring types.
- Systems Monitoring – This is monitoring which is testing system level components (aka Component monitoring) and typically the tests originate on the server that is being tested and is sent to a central location. Basic monitoring that can come from this includes:
- CPU, Disk, Memory, and Network utilization metrics (analyzed in real time and stored on a central server for later reference)
- Process monitoring (Up/down, and CPU and memory usage)
- Log file monitoring (Looking for errors, parsing response times, etc)
- Application Internals Monitoring – This type of monitoring dives deep in JVMs (J2EE deep dive) and .NET virtual machines to track the calls an application makes and the performance of those function calls. This doesn’t give a good indication of end user experience directly, but it is an excellent tool to diagnose bottlenecks within application code. Generally, this type of monitoring should be set up on the Performance Testing environment to give additional data when those tools detect issues.
These types of monitoring can come together to give an excellent picture of how an application is currently performing when used properly. The data gathered can be put into a dashboard, a self-service reporting site, etc and the raw data can be used to create very detailed Capacity planning profiles. Also, when implemented properly, monitoring of this sort can alert application teams to problems within their application before they impact end users. Most companies are currently in a “reactive monitoring” environment. We can get to a “pro-active” environment if we’re willing to make the proper investments. The ROI is extreme, with the right tool application outages can be reduced by as much as 70% or more.
(Direct connections: Reporting. Data feeds from: Monitoring, Performance Testing)
Performance Testing handles some of the capacity planning needs that a large organization has, but not all of them. It is the Capacity Planning aspect of Performance Testing that really saves companies so much money and what really makes the investment in the tools worth it. However, there is more to it. Taking statistical views of production data and projecting CPU usage, memory usage, etc into the future and being able to predict capacity limits long before they are reached is invaluable. The ability to know which servers can afford to be virtualized and which servers need their own IO bus is also critical to any virtualization effort. Those two things being coupled together mean a extreme cost savings. Capacity planning is the green cash cow, but it can’t happen without the Performance Testing and Monitoring implementations to give Capacity Planning the data to analyze.
(Feeds from: Monitoring, Capacity Planning)
Reporting is the face of the Performance Management organization. It is where executives can go to get an overview of how their application is performing and where techs can go to get the details of an alert or ticket they have received. It is the place where the correlation of data (both manual and automatic) happens. An open philosophy is critical to making this structure work. Any employee should be able to see performance data for any application. This type of openness encourages communication and drives away the notion that an application problem is only impacting the application group that owns it. In short, it is the thing that ties all of this together.
An organization that can properly implement all of these teams with the proper methodologies can see the productivity in their IT environment soar. Once we’re able to get people on board with the idea that it is O.K. to have a problem so long as we can pinpoint and fix it then problems tend to vanish and real innovation can start to happen. Creating these groups, making them work closely together, and mandating that they are the gatekeepers to any production releases are the keys to realizing the original promise of bringing I.T. into the workplace to begin with. That promise being the ability to rapidly innovate business with technical solutions that just work.
About the Author
Matthew Bradford has been in the I.T. Performance Business for 13 years and has been critical to the success of many Fortune 500 Performance Management groups. He is currently the CTO of InsightETE, an I.T. Performance Management company specializing in passive monitoring and big data analytics with a focus on real business metrics.[contact-form-7 404 "Not Found"]