Introduction

Application monitoring is the act of using some sort of tool to ensure that an application is responding as expected. The results of those periodic checks are sent to a central location and any errors are raised to the appropriate application team to work on getting the issue resolved. This is done to ensure the highest possible level of service availability for the application and, ideally, catch errors before they impact the end users of the application.

Application monitoring can be done in a multitude of ways from having a team of individuals doing manual checks of an application on some sort of schedule to programming devices to do the same thing. The types and uses for monitoring are quite diverse and the proper implementation of the right monitoring techniques for modern applications is critical for business to stay competitive in an environment where 24×7 uptime means everything and downtime, no matter how fleeting, can potentially cost millions of dollars.

To this end, there are a few broad categories of monitoring in which any automated form of monitoring can be classified. Those categories are listed below…

Systems Monitoring

Systems monitoring is the most obvious form of monitoring to a technical engineer. It is the act of looking at the components of a system and reporting on their status. Metrics to capture might include:

  • Overall CPU utilization
    If CPU utilization is running too high it can be an indicator of performance issues on the machine. However, a well tuned system should have a high CPU usage, so this alone is not a good indication of the load on a server.
  • Process-specific CPU utilization
    The concept of checking a specific process’ CPU load is mostly useful for detecting what is referred to as a “runaway process.” The idea here is that a bug in a computer program can put the application into an “infinite loop” or some other such issue and cause the CPU to be fully utilized trying to resolve an irresolvable problem in the computer program. This is generally a good indicator of a real issue.
  • Overall memory utilization
    This type of monitor basically sets a minimum threshold for available memory and is an indication that a server may be over-taxed if the available physical memory is too low. At a certain point the server must start using virtual memory (using the disk drive as memory) which will have significant performance impacts.
  • Process-specific memory utilization
    Like the process-specific CPU monitor the idea behind this one is to ensure that the memory usage for a specific process doesn’t go above a certain level. More advanced versions of this monitor may even trend this data over time. The purpose is to identify what is called a “memory leak” in the application which will eventually cause the server to swap to disk and cause significant performance degradation.
  • Free disk space
    Disk space monitors typically just use various thresholds to indicate that a disk’s usage is critically high and there may no longer be enough space on the disk for standard operations.
  • Disk queue length
    Defining exactly what is a “disk queue” is beyond the scope of this document, but basically if one could imagine a line of people at a library waiting for the librarian to fetch a book for them then the basic idea of how disk queues can be understood. When the line gets too long then everything waiting on that book (data from the disk) is held up. This is one of the most common performance issues applications experience especially on database servers.
  • Whether or not a process is running
    Obviously, if the program that is needed to process a user’s data isn’t running then the program can’t process the data. Sometimes a process may just crash (stop running due to some error) or quit for some other unexpected reason. This type of monitor will help to alert the proper people to ensure the process is restarted.
  • Log file monitoring
    For those unfamiliar with a log file, it is a file in which an application will automatically record critical pieces of data for future use. Very typically in this log will be references to major actions (such as a login to an application) and any errors which are caught and handled. Since this file is updated in near-time it is possible for a monitor to look at the new entries in the log and pick out key words which would indicate issues that need human attention and alert based upon that data.
  • Anything that can be done from a command line on a system
    The flexibility of systems monitoring is limited only by the imagination of the engineer doing the work. Anything that can be done from a command line can be put into a monitor that is run on the server. The primary limitations are centered around how many resources the server will need to dedicate to monitoring itself and the ability of the monitoring engineer to distill the data into simple “good” or “bad” responses.

With all the flexibility of Systems (or Component) Monitoring, why would anyone need anything else? Well, the answer is two-fold:

  1. Correlating the data from all of those sources is time consuming, difficult, and many times unreliable. Each monitor is looking at a single piece of a much larger working system and, alone, is not a good indication of overall application performance or availability.
  2. This type of monitoring does not look at the external components on which the application may rely. There are many potential issues which would prevent an end user from getting to an application such as a faulty router in the network or even another server in the cluster. A misconfigured load balancer could cause issues for an end user which would never show up on systems monitoring as well.

Examples of Systems Monitoring Tools

  • HP OpenView/Operations Manager
  • Sitescope
  • IBM Tivoli Monitoring (ITM)

End User Experience Monitoring

End User Experience Monitoring is performed by looking at an application the same way a real end user would see it. For the purposes of monitoring the central servers for their availability and performance, it isn’t required that the monitoring use the actual applications that a user would use. It is plenty to simply send the same network traffic to the central servers that the end user’s application would send. Thus, it isn’t required to run Internet Explorer to test a web site, it is enough to send the same commands that Internet Explorer would send. It is this distinction which is important to bear in mind when reading the following:

End User Experience Monitoring is not intended as a substitute for functional testing an application’s front-end. It is unrealistic to expect a few probes to be able to test the front-end systems (ie, code that is executed on the end user’s PC, not code that is executed on the central servers) because of the fact that there are simply too many variables to account for. An end-user’s PC is, by the nature of the fact that an individual is running it, a chaotic environment. A server farm is run with a specific purpose with all variables accounted for. That type of environment is the only one which will yield statistically relevant results.

There are two main categories of End User Experience Monitoring as described below:

Synthetic Transaction Monitoring

A Synthetic Transaction Monitor is a script which reproduces the server calls to an application and records the results. The idea behind it is to have a computer mimic what an end user does and executes its own business transactions on a system. Generally these scripts are made to do read-only functions in an application because writing data automatically almost always has undesirable consequences.

Advantages:

  • There is a consistent script running from a consistent set of locations at consistent times
  • It is known exactly what a script is doing, so if it runs into an issue, it can be easier to trace what is broken in the application
  • This makes apples to apples comparisons on the impact of code releases easy to gauge

Disadvantages:

  • Running the same transaction can produce unrealistically fast response times due to caching on the server side
  • There is additional load being put on the servers by the monitoring probes
  • This method is reliant on the application owners to accurately identify the most heavily used portions of their application
  • The depth of the probe is limited to read-only operations

Examples of Synthetic Transaction Monitoring Tools:

  • BPM (Business Process Monitoring using VuGen scripts)
  • Sitescope
  • IBM Tivoli Composite Application Manager (ITCAMS)

Real Transaction Monitoring

Real Transaction Monitoring is a method which employs one or more technologies to look at actual transactions created by real users as they are being processed by the system. These technologies may utilize addons to virtual machines on which the application depends (such as Java or .Net) or may be a network sniffer looking at packets as they pass between the client and the server.

Advantages:

  • All transactions are monitored giving a full view of the entire application’s performance and availability
  • This type of monitoring gives the ability to view the performance of a more diverse set of transactions
  • The data can be used to generate detailed usage reports to identify confusing segments of a user interface, unintended shortcuts, and even potential security issues

Disadvantages:

  • It is impossible to know if an application or transaction is available if it is not being actively used
  • The sheer amount of data gathered can be overwhelming to sift through and can be expensive to store

Examples of Real Transaction Monitoring Tools:

  • HP Real User Monitor (RUM)
  • HP Diagnostics
  • InsightETE
  • Computer Associates Wiley

Miscellaneous Distinctions

Active vs Passive Monitoring

An active monitoring solution involves some sort of program installed on the server to be monitored or some external device actively probing the server directly. These programs and probes do take some server resources to support. Generally, even on the most heavily monitored machines, the monitoring should never produce more than a 3% overhead.

Passive monitoring, on the other hand, involves some sort of monitoring which is completely detached from the server being monitored. This type of monitoring is best used in environments where server resources are already over-extended or legacy systems which are unable to support modern monitoring tools. Generally these types of tools will monitor network activity from a router or a special network device which intercepts network traffic for a specific server.

Agent vs Agentless Monitoring

An agent-based monitoring solution is one where a piece of software is installed to collect data on either an external agent or on the server itself. Essentially the agent is any additional piece of software, outside of a central repository for the data, that needs installed to collect monitoring data.

Agentless monitoring involves using already existing hooks to gather the monitoring data needed. Examples of these hooks include:

  • Network file shares
  • SNMP Interfaces
  • Windows Management Interface (WMI)

The agentless monitor will remotely probe those, and other, services and report on the results.

In Conclusion…

Monitoring is most delicate arm of the Application Performance Management framework. It requires one to strike a balance between gathering the critical data needed to get a good pulse on the production operations and the absolute need to have as little impact on those operations as possible. However, being able to combine the monitoring techniques listed in this paper and store them in a historical database for research and reference enables keen IT Operations Managers to predict with great accuracy how their systems will perform and identify the weak points as they are failing, not after they have already failed.

About the Author

Matthew Bradford has been in the I.T. Performance Business for 13 years and has been critical to the success of many Fortune 500 Performance Management groups. He is currently the CTO of InsightETE, an I.T. Performance Management company specializing in passive monitoring and big data analytics with a focus on real business metrics.

[contact-form-7 404 "Not Found"]