Source: Building microservices on Azure
Under these conditions, monitoring the availability of individual servers or application frontends is hardly enough to guarantee total availability. Your application’s frontend service may respond without issue even if a critical problem with another service renders the application incapable of functioning. Also complicating matters is the fact that, because service instances come and go constantly, a service that is available one minute on one endpoint may disappear the next — not because the service has failed, but simply because it has shifted to a new instance running somewhere else in the cluster.
In short, modern environments are so much more complex, both in terms of their architecture and their dynamism, that traditional availability monitoring is hardly enough to accommodate them.
Infrastructure is beyond your control
A second challenge is that many modern environments are hosted in infrastructure that SRE and ITOps teams don’t fully control. Your team might use serverless functions, for example, which provide users with control only over the serverless environment, leaving no ability to instrument data collection from the host operating system. You face similar challenges if you deploy containers via a managed Kubernetes service that abstracts your environment away from the host infrastructure. Even on comparatively simple cloud-based virtual machine instances, you lack access to the underlying bare-metal servers.
This abstraction from infrastructure makes it more difficult to check the status of the servers and networks on which your applications depend, which in turn increases the risk that users will discover issues before your own team does.
As a result, end-to-end monitoring of the application environments that you can control is more important than ever. It’s the only way to maximize your ability to find problems that, in some cases, originate in layers of your stack that you can’t directly control or access.
Services are beyond your control
Likewise, you may not fully control the services on which your applications depend. You may incorporate third-party APIs or services into your applications, for example. Without being able to control the environments from which those APIs and services originate, you can’t monitor them on the backend. This limitation also increases the stakes of end-to-end monitoring of what you can see within your own environment.
“Slow” is the new “down”
In the past, simple uptime monitoring formed the basis of availability monitoring: you checked whether an application was up or not, and then called it a day.
Today, however, users expect more than just applications that are up. They want applications that respond quickly. If an application is up, but takes 30 seconds or more to respond, it may as well not be available at all, as far as the user is concerned. Most users won’t wait 30 seconds; a majority will abandon apps after a delay of only three seconds.
This means that modern availability monitoring requires more than simply monitoring for availability in the narrow sense. You must also check for responsiveness, which means not just how fast a service responds to a request, but also how quickly new service instances can spin up in response to shifts in application demand.
Diversity of clients and conditions
Even if you have some data to show that your app responds quickly — say, within three seconds — that alone is not necessarily enough to guarantee a positive user experience.
That is because speed is dependent on the environment or the client that makes the request, and it is subject to a wide diversity of conditions. Perhaps you are monitoring a service from a desktop or even from inside your own data center, and you’re able to connect and get a response in two seconds. That’s great for you, but it doesn’t mean all of your users have the same experience. What is the response time for a battery-powered and thermally throttled mobile device that is connecting to the service over a congested and higher latency mobile network? What about users who are geographically distant from your data center?
The ability to test for a variety of use cases and conditions is critical for ensuring that all users — not just users working under the perfect conditions — enjoy a quality experience.
Optimizing user experience performance
Along similar lines, availability monitoring today means tracking what is known as user experience performance, which refers to the total experience that your site or application imparts on users. Monitoring for user experience performance requires monitoring each and every interface, request and workflow that users can initiate within your application as well as ensuring that it responds adequately. From this perspective, too, simply checking the availability of core services does not suffice for delivering an optimal user experience.
Tips and best practices for modern availability monitoring
Faced with these challenges, SRE and ITOps teams must evolve their thinking about availability monitoring in a number of ways:
1. Think holistically
Because monitoring for availability today requires monitoring all application services and components at all times, teams must think holistically. Focusing on key application elements or infrastructure doesn’t work. End-to-end coverage via automated tools that can monitor any type of infrastructure or service is a must.
What that means in practice is ensuring that every layer of infrastructure, every service and every endpoint within your application hosting stack is monitored continuously. You can’t just monitor the frontend or host servers. You need to track the health and performance of backend components as well. And, if you have multiple instances of the same application or service running — which you likely do in a containerized or serverless environment — you need to monitor every instance.
2. Think beyond conventional services
Instead of monitoring just servers or the applications themselves, teams need to track the availability and responsiveness of every type of resource in their environments. They must be able to monitor containers, serverless functions, orchestration engines, native and third-party APIs and more. They must be able to do this even as the configurations of these resources constantly shift. And they must also be able to measure how different configurations or conditions on the client’s end impacts performance.
3. Think beyond uptime
As noted above, merely determining whether an application or service is “up” is not enough to address modern availability requirements. Optimizing user experience management requires tracing the responsiveness of all elements of your application environment, across all of the touchpoints or user journeys that your customers may undertake. You must identify the level of responsiveness that your users expect — which, as noted above, is generally in the realm of a response rate of three seconds or less — and ensure that every transaction by every user meets that goal.
4. Think about relationships
In microservices applications and highly distributed environments, your team can’t monitor or manage availability for services on an individual basis. You need to know when individual services, VMs, containers or other application components fail or respond slowly, but you must also be able to map the impact of one component’s failure on other components.
In other words, you must be able to map the relationships between components so that you understand how availability and performance trends “flow” across your environment: how a backend database failure will impact an interface frontend, for example, or how a slow-to-respond external API will impact your authentication modules. A stack of components that are perfectly healthy individually may not add up to a perfectly healthy user experience, due to issues like data validation or high network latency that can hamper interactions between components.
The meaning and nature of availability has fundamentally changed. Whereas mere uptime monitoring of individual servers and applications was once enough to deliver a reasonably positive customer experience, today’s environments require a broader, deeper and more insightful understanding of your environment’s state.
Not only do you need to perform end-to-end monitoring that checks the availability of every facet of your environment, but you must also be able to evaluate responsiveness as well as mere uptime. And you need to map complex service relationships so that you understand how an availability problem in one part of your environment will impact the rest of your application. A critical tool for this is Splunk Digital Experience Monitoring, which helps drive great customer experience and business outcomes through RUM (field data), synthetics (lab data) and insight into web performance optimization best practices. Watch this video and read this whitepaper to learn more.
Today, Splunk Digital Experience Monitoring is part of the Splunk Observability Cloud — it provides the functionality you need to thrive in the face of modern availability monitoring challenges. By continuously ingesting data from across your environment (regardless of the types of applications you run or the infrastructure you use) and then using AI to separate noise from data, watch this demo video on how Splunk Observability Cloud provides actionable insights that help you find problems before they find your customers. Learn why Gigaom ranks Splunk Observability Cloud as an Outperformer for “massive scalability, sophisticated in-stream analytics and native OpenTelemetry support” before signing up for a free trial today.