Skip to main content


Article

How Real User Monitoring optimized the Troubleshooting Process

By Collin Chau

In today's world, customers are increasingly intolerant of delays in the fulfillment of their requests. If your system is slow, it may as well not be up at all. As far as the customer is concerned, slow is the new down for systems.

To address this challenge, site reliability engineers (SREs), IT Ops teams and developers are adding real user monitoring (RUM) to their toolsets to gain insights into the true customer experience in a production setting. RUM solutions provide data about how actual users are experiencing applications, thereby allowing teams to identify and fix performance issues that they could not detect via other means.

Splunk RUM, which is part of Splunk’s comprehensive observability suite, provides these insights. It’s an end-to-end observability suite that  empowers stakeholders within the entire software delivery lifecycle (SDLC) to understand the impact that incidents and changes may have on customers. 

The challenge of RUM

RUM works by collecting data about transactions initiated by actual users. Thus, RUM provides visibility into the actual state of production in a way that synthetic monitoring cannot.

Yet using RUM data to understand what is happening within an application environment and translating those insights into customer experience optimizations is challenging, especially in today’s distributed, cloud-native systems. In an environment where application instances are constantly starting and stopping, network mappings are changing and services are scaling up and down, it’s not always obvious how to map data about individual real-user experiences to the broader customer experience.

To leverage RUM effectively, then, developers and other stakeholders in the SDLC must be able to map the data from RUM tools to the dynamic microservices running in their environment. Doing so requires overcoming several challenges.

A modern RUM strategy

A modern RUM strategy is founded upon several key features that enable SREs, IT Ops engineers and developers to make sense of RUM data even in complex, cloud-native environments.

Complete, unsampled data offers full visibility

Conventional RUM strategies typically relied on sampled data, which meant tracing only some transactions and extrapolating — an attempt to build a full picture of what was happening inside an application environment.

In a cloud-native world, sampled data is unreliable. To gain complete visibility, developers and SREs need to be able to trace and analyze every layer of every transaction across every service instance. Without this comprehensive coverage, teams are at risk of failing to detect outlying problems, such as those experienced by only a subset of users or ones associated with a specific node or container inside a much larger cluster.

Unsampled, end-to-end traces also provide engineers with the ability to understand the relationships between frontend and backend problems, and to evaluate how an issue deep within the application — like a failing database or authorization service — impacts users on the frontend.

Core Web Vitals deliver actionable data

To make RUM metrics as actionable as possible, engineers should adopt tools that make it easy to collect and analyze the most important data produced by RUM. It’s helpful to think in terms of the three Core Web Vitals:

  • Largest contentful paint (LCP)
  • First input delay (FID)
  • Cumulative layout shift (CLS)

In complex environments, these metrics help to surface and identify the root cause of complex performance problems. Finding the LCP within a web application that consists of a variety of elements is critical for ensuring the overall performance of the app. Tracking CLS metrics helps engineers understand how end users experience an application as they navigate through it. FID metrics help pinpoint the source of slow loading times and other frontend performance issues that may be caused by problems on the backend.

Focusing on metrics like these gives developers and SREs the ability to home in on the cause of problems. That beats merely detecting problems, which is what happens when teams track more primitive types of metrics, like overall page load time.

Focus on SLO targets

It’s easy for engineers to become absorbed in monitoring real-user transactions for monitoring’s own sake, without thinking about how monitoring data impacts users. To escape that mindset, developers and SREs should strive for RUM strategies that focus on assessing how well their applications meet user-centric SLO goals.

User-centric SLOs are guarantees that the business makes to users about factors such as page load time and availability. Even as engineers track complex “Core Web Vitals” and other metrics that allow them to understand how traces flow across their application, they should be sure as well that they are meeting the basic SLO promises made to end users. Ultimately, adhering to these promises is what matters most for optimizing the customer experience.

Map issues to your runtime architecture

As noted above, in order to provide actionable insights and facilitate a more positive customer experience, RUM must do more than expose surface-level problems. It must be able to map performance problems that are evident on the frontend to the backend services that are causing those issues.

Toward this end, RUM solutions should make it easy to map data from user transactions to business KPIs, such as user retention goals and uptime guarantees. By visualizing transaction trends and drilling down into specific traces, engineers should be able to see easily how performance issues detected in RUM data impact customer experience from a business perspective.

OpenTelemetry standardization

In order to avoid locking themselves into proprietary tools, developers and SREs should choose RUM solutions that rely on community-defined standards for data ingestion. On this front, OpenTelemetry is the clear choice.

By using OpenTelemetry to collect data, you ensure that your instrumentation is compatible with any standards-based RUM tool, and that it will continue to work if you change solutions or vendors. Because OpenTelemetry eliminates the need for developers to invest time in building their own custom instrumentation for data ingestion, or for your SREs and IT Ops engineers to learn a custom ingestion framework, it simplifies and speeds RUM deployment.

At the same time, because OpenTelemetry works with virtually any type of application and in any environment, it frees teams to update and evolve their applications as they need without having to worry that the changes will break their RUM tooling. The result is applications that can scale seamlessly, unencumbered by visibility challenges.

Conclusion

RUM is not a new discipline. RUM tools have played a central role in SRE, IT Ops, and development workflows for well over a decade.

What has now changed, however, is the depth and actionability of RUM data. By delivering complete, unsampled data, allowing teams to isolate different types of metrics and problems, and mapping data to runtime architectures, Splunk RUM empowers businesses to translate real-user data into real-world customer experience optimizations. And because Splunk RUM is part of Splunk Observability Cloud, it’s easy to contextualize RUM data alongside the visibility insights provided, making the troubleshooting process even faster and more efficient. 

Learn more about how Splunk RUM can help your team make your applications better by understanding your user’s experience. Watch this demo to see this in action and then start your free trial of Splunk Observability Cloud

 

Alt

Guide

SRE and the Four Golden Signals of Monitoring

Learn More
Alt

Guide

Faster, Smarter Resolution: The Incident Response Guide for Kubernetes

Read Now
Alt

Guide

Why DevOps Matters: A Guide for Collaborative Transparency in Incident Management

Learn More
Learn more about Splunk Observability Cloud