The State of Availability Today: Availability Monitoring & Management

At first glance, availability monitoring may seem like one of the more mundane responsibilities of site reliability engineering (SRE) and IT operations teams. Determining whether an application is available may appear to be relatively straightforward, especially for teams that focus simply on monitoring certain transactions or services.

This may have been true in the past. When applications and infrastructure — and customer journeys — were comparatively simple, availability monitoring indeed boiled down to simple workflows like checking whether key application services were responding.

Today, however, the landscape surrounding availability monitoring has shifted significantly. SRE and IT teams are responsible for managing applications that may consist of hundreds of pages, all dynamically generated. Users and managers expect resolutions in minutes, not hours or days. (In fact, availability is one of the Four Golden Signals of Monitoring and part of the CIA triad, a core InfoSec concept.)

Due to both the complex nature of modern application environments and the multiple customer touchpoints that applications typically entail, SRE and ITOps teams must rethink their approach to availability monitoring.

So, in this article, let’s discuss:

  • Conventional availability monitoring strategies and why they fall short of fulfilling business requirements for modern teams.
  • How today’s SREs and ITOps engineers can get the most out of the availability monitoring tools available to them by approaching availability as a discipline that requires holistic, end-to-end visibility into highly complex application environments.

The traditional approach to availability management

Traditionally, availability management amounted primarily to monitoring individual applications for uptime. Teams deployed tools to check whether the application responded to generic requests and then configured alerts to fire if it didn’t.

This approach worked well in an era when applications were monolithic in nature and when they were hosted in relatively simple environments (like virtual machines or bare-metal servers) that did not rely on a complex web of services and dependencies. Under these circumstances, checking the application frontend for availability was sufficient for identifying downtime. You knew you had a problem when the webpage that every user had to access wouldn’t load or the server went down. Beyond that, there wasn’t much nuance to account for.

In addition, because user interactions with software were predictable, teams could rely on generic or sampled requests to gauge availability. End-to-end monitoring of every type of application request was less important.

Performance, too, was less of an issue. The main responsibility of IT and SRE teams was to keep things up, not to ensure that they were responding as quickly as possible.

(Understand on-premises application monitoring.)

Challenges of traditional availability management

Over the past half-decade, however, the shift from conventional software environments to cloud-native computing has significantly complicated the landscape surrounding availability monitoring. Conventional approaches fall short, for several reasons.

Ephemeral, loosely coupled services

Gone are the days when SREs and IT engineers had to manage just monolithic applications running on bare-metal servers or VMs. Modern application environments are frequently composed of a complex web of microservices spread across a cluster of servers, which more likely than not are running in the cloud. Not only are these services highly distributed, but they also consist of multiple instances that are mapped to constantly changing network endpoints.

Under these conditions, monitoring the availability of individual servers or application frontends is hardly enough to guarantee total availability. Your application’s frontend service may respond without issue even if a critical problem with another service renders the application incapable of functioning. Also complicating matters is the fact that, because service instances come and go constantly, a service that is available one minute on one endpoint may disappear the next — not because the service has failed, but simply because it has shifted to a new instance running somewhere else in the cluster.

In short, modern environments are so much more complex, both in terms of their architecture and their dynamism, that traditional availability monitoring is hardly enough to accommodate them.

(Learn how microservices work.)

Infrastructure is beyond your control

A second challenge is that many modern environments are hosted in infrastructure that SRE and ITOps teams don’t fully control. Your team might use serverless functions, for example, which provide users with control only over the serverless environment, leaving no ability to instrument data collection from the host operating system. You face similar challenges if you deploy containers via a managed Kubernetes service that abstracts your environment away from the host infrastructure. Even on comparatively simple cloud-based virtual machine instances, you lack access to the underlying bare-metal servers.

This abstraction from infrastructure makes it more difficult to check the status of the servers and networks on which your applications depend, which in turn increases the risk that users will discover issues before your own team does.

As a result, end-to-end monitoring of the application environments that you can control is more important than ever. It’s the only way to maximize your ability to find problems that, in some cases, originate in layers of your stack that you can’t directly control or access.

Services are beyond your control

Likewise, you may not fully control the services on which your applications depend. You may incorporate third-party APIs or services into your applications, for example. Without being able to control the environments from which those APIs and services originate, you can’t monitor them on the backend. This limitation also increases the stakes of end-to-end monitoring of what you can see within your own environment.

“Slow” is the new “down”

In the past, simple uptime monitoring formed the basis of availability monitoring: you checked whether an application was up and called it a day.

Today, however, users expect more than just applications that are up. They want applications that respond quickly. If an application is up, but takes 30 seconds or more to respond, it may as well not be available at all, as far as the user is concerned. Most users won’t wait 30 seconds; a majority will abandon apps after a delay of only three seconds.

We’re in the ‘moment business,’” says Yasaswi Pulavarti, VP of digital engineering and services at Papa Johns, the world’s third-largest pizza delivery company. “When customers order, they need speed at that moment. We’re constantly releasing new features, but resilience is at the center of it all so that our systems are always available to our customers and restaurants.” 

This means that modern availability monitoring requires more than simply monitoring for availability in the narrow sense. You must also check for responsiveness, which means both:

  • How fast a service responds to a request
  • How quickly new service instances can spin up in response to shifts in application demand

(See how synthetic monitoring gets you beyond slow.)

Diversity of clients and conditions

Even if you have some data to show that your app responds quickly, that alone is not necessarily enough to guarantee a positive user experience.

That is because speed is dependent on the environment or the client that makes the request, and it is subject to a wide diversity of conditions. Perhaps you are monitoring a service from a desktop or even from inside your own data center, and you’re able to connect and get a response in two seconds. That’s great for you, but it doesn’t mean all of your users have the same experience.

  • What is the response time for a battery-powered and thermally throttled mobile device that is connecting to the service over a congested and higher latency mobile network?
  • What about users who are geographically distant from your data center?

The ability to test for a variety of use cases and conditions is critical for ensuring that all users — not just users working under perfect conditions — enjoy a quality experience.

Optimizing user experience performance

Along similar lines, availability monitoring today means tracking what is known as user experience performance, which refers to the total experience that your site or application imparts on users. Monitoring for user experience performance requires monitoring each and every interface, request and workflow that users can initiate within your application as well as ensuring that it responds adequately.

From this perspective, too, simply checking the availability of core services does not suffice for delivering an optimal user experience.

Best practices for availability monitoring

Faced with these challenges, SRE and ITOps teams must evolve their thinking about availability monitoring in several ways:

1. Think holistically

Because monitoring for availability today requires monitoring all application services and components at all times, teams must think holistically. Focusing on key application elements or infrastructure doesn’t work. End-to-end coverage via automated tools that can monitor any type of infrastructure or service is a must.

What that means in practice is ensuring that every layer of infrastructure, every service and every endpoint within your application hosting stack is monitored continuously. You can’t just monitor the frontend or host servers. You need to track the health and performance of backend components as well. And, if you have multiple instances of the same application or service running — which you likely do in a containerized or serverless environment — you need to monitor every instance.

2. Think beyond conventional services

Instead of monitoring just servers or the applications themselves, teams need to track the availability and responsiveness of every type of resource in their environments. They must be able to monitor:

  • Containers
  • Serverless functions
  • Orchestration engines
  • Native and third-party APIs
  • And more

Teams must be able to do this even as the configurations of these resources constantly shift. And they must also be able to measure how different configurations or conditions on the client’s end impacts performance.

3. Think beyond uptime

As noted above, merely determining whether an application or service is “up” is not enough to address modern availability requirements. Optimizing user experience management requires tracing the responsiveness of all elements of your application environment, across all of the touchpoints or user journeys that your customers may undertake.

You must identify the level of responsiveness that your users expect — generally in the realm of a response rate of three seconds or less — and ensure that every transaction by every user meets that goal.

A great example of this is peak or seasonal traffic, like knowing which days you might get a lot more traffic than normal. Tesco ensured a quick and seamless online ordering experience — even when over 20,000 people logged into Tesco’s waiting room at once during the 2020 holiday surge.

“Splunk played a critical role in making sure that we identified and resolved any glitches in our systems as quickly as possible so that we could ensure customers had their turkey on Christmas Day,” says Chirag Shah, head of technology, group monitoring for Tesco.

4. Think about relationships

In microservices applications and highly distributed environments, your team can’t monitor or manage availability for services on an individual basis. You need to know when individual services, VMs, containers or other application components fail or respond slowly — and you must map the impact of one component’s failure on other components.

In other words, you must have the ability to map the relationships between components so that you understand how availability and performance trends “flow” across your environment. For example:

  • How a backend database failure will impact an interface frontend
  • How a slow-to-respond external API will impact your authentication modules

A stack of components that are perfectly healthy individually may not add up to a perfectly healthy user experience, due to issues like data validation or high network latency that can hamper interactions between components.

Availability means more

The meaning and nature of availability has fundamentally changed. Whereas mere uptime monitoring of individual servers and applications was once enough to deliver a reasonably positive customer experience, today’s environments require a broader, deeper and more insightful understanding of your environment’s state.

Not only do you need to perform end-to-end monitoring that checks the availability of every facet of your environment, but you must also be able to evaluate responsiveness as well as mere uptime. And you need to map complex service relationships so that you understand how an availability problem in one part of your environment will impact the rest of your application.

Splunk supports availability monitoring & management

Be prepared every day,  anywhere around the globe, for upticks in traffic that deliver experiences customers need. See how Splunk supports your monitoring and observability practice

What is Splunk?

The original version of this blog was published by Billy Hoffman. This posting does not necessarily represent Splunk's position, strategies or opinion.

Billy Hoffman
Posted by

Billy Hoffman

For over 15 years Billy has spoken internationally at conferences and written 2 books on how to build fast and secure websites. While CTO at Rigor, Billy on helped customers create strong performance cultures and understand the importance of performance to the business. Following Rigor's acquisition by Splunk, Billy focuses on improving and integrating the capabilities of Splunk's APM, RUM, and Synthetics products.