Application performance monitoring (APM) has been around for decades. In some cases, the metrics that teams track to perform APM remain the same as they were years ago. But other metrics have changed — and new ones have appeared — as application architectures and deployment patterns have evolved.
What’s more, monitoring modern applications effectively requires more than analyzing individual types of metrics in their own siloes. The ability to correlate metrics of different types and analyze them through a single interface is critical for understanding complex behavior patterns. For example, you may need…
- Data about server CPU usage to determine the root cause of a high application error rate.
- Data about container start time to troubleshoot issues related to latency and request duration.
With that in mind, here’s a look at key metrics to monitor for modern APM scale-out, cloud-native and continuously delivered application environments.
Traditional APM Metrics
Let’s start with an overview of metrics that have always been important to monitor for virtually any type of application. These metrics are the bread and butter of APM, and, when paired with application metrics, they form the foundation for any monitoring strategy today.
Error rate is a measure of how many application requests result in failures. What counts as a failure or error will depend on which type of application you’re running:
- A 404 response in a web app would be an error.
- An application instance or process that exits with an error code, or simply logs a failure to answer a given request, would also be an error.
- In some languages, like Java, you may want to consider exceptions a form of error too.
Duration (Response time)
Response time is a metric that tracks how long it takes applications to handle requests for resources. Those requests could be:
- Transactions initiated by end-users, such as a request to load a web page
- Internal requests made by one part of your application to another, such as one process or microservice requesting data from memory or disk
Uptime measures the total amount of time, in the form of a percentage, that your application is available and responding normally.
Uptime is often a proxy for the overall health and responsiveness of your app. This metric is also important to monitor because it’s common for companies to guarantee certain uptimes in their contracts with customers (often noted under service level agreements or objectives), so you need to know if you are not delivering on what you promise.
(Explore more golden signals of monitoring.)
Tracking how much memory (or, in cruder terms, RAM) your application uses is crucial for identifying memory leaks that could eventually cause a failure. It also helps ensure the app has sufficient memory to perform well.
Sudden spikes in CPU usage could be a sign of an application performance problem. They could also reflect fluctuations in demand for the application, which may require you to add more application instances. As a general rule of thumb, you don’t want total CPU usage to exceed 70% for more than 30% of the total time that your application is running. If it does, you risk running out of CPU capacity.
(Read more about server monitoring.)
An application that runs out of persistent storage will fail if it depends on persistent storage to do its job. That makes disk usage another important APM metric to track.
I/O refers to the rate at which applications read or write data. It’s used most to track the performance of persistent storage media (like hard disks), but you could also track I/O rates for memory or software-defined storage systems (like virtual disks) if your application uses them.
Beyond traditional APM metrics, there are additional metrics that are often important to track in modern, cloud-based applications. Importantly, these are infrastructure metrics rather than application metrics — but they’re still critical for APM.
That is because, as noted above, being able to analyze these metrics alongside application metrics is critical for gaining full visibility into what is happening within your environments.
Container start time
Tracking how long your containers take to start (if you are using containers to host your applications) will alert you to potential problems with your container images, orchestration tools, or other resources that may cause containers to take longer than desired to launch. In general, it shouldn’t take a container more than a few seconds to start.
If you deploy applications to a Kubernetes cluster, knowing how many nodes are available and responding (as a percentage of the total nodes in the cluster) helps you identify problems with your infrastructure. Low node availability could also be a sign of network issues that are causing worker nodes to fail to communicate with the master node.
Cloud spend metrics
If you host applications in the public cloud, you probably want to avoid overspending by tracking your costs on an ongoing basis. The individual metrics to collect for this purpose will depend on which cloud services you use and how your workloads are provisioned within them.
In general, however, tracking metrics like these will help you achieve real-time visibility into cloud costs:
- The number of API calls your application makes to cloud storage services
- Total data egress rates
- The running time for cloud-based virtual machines (VMs)
(Learn more about monitoring VMs.)
DevOps metrics — by which I mean metrics associated with DevOps delivery processes — are another sub-category of information to measure. Data such as how often you are releasing new application versions, how long it takes to test, build, and deploy, and what your test success rate is will help ensure that you maintain a smooth application delivery pipeline.
These aren’t performance metrics, but they’re just as crucial if you want to optimize the end-user experience and avoid unexpected problems within your development and deployment workflow.
Observability is the next step
Today, APM requires measuring a variety of traditional application metrics. To maximize the value of APM workflows and tools, however, it’s helpful to think beyond conventional APM by extending your monitoring strategy to include the categories of metrics that are important in cloud-native, DevOps-oriented environments, then correlate all of your data together to gain end-to-end visibility into complex microservices environments.
Unified analytics tools save developers, site reliability engineers (SREs) and IT engineers from having to stitch different types of metrics together or make guesses about how different patterns may be related. This level of correlation of multiple types of application and infrastructure metrics is essential for fast, accurate root cause analysis and incident remediation.
What is Splunk?
Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure, and networking. He is Senior Editor of Content and a DevOps Analyst at Fixate IO. His book For Fun and Profit: A History of the Free and Open Source Software Revolution was published in 2017.
This posting does not necessarily represent Splunk's position, strategies or opinion.