By Mike Mackrory
Cloud-native applications and most modern computing systems employ a distributed services architecture. The benefits of these systems include rapid development and deployment as well as flexible dynamic scaling. Unfortunately, the nature of these systems can make monitoring and troubleshooting an incredibly challenging endeavor.
These systems also have a very diverse user base, and the variety of their needs, together with the complexities of distributed systems, can certainly make operational support more challenging. This article will address these challenges and explain how Real User Monitoring ( RUM) and distributed tracing can simplify them. As you will see, RUM and distributed tracing can be used as powerful allies to help identify and resolve problems.
RUM – and Why It’s Important
Real User Monitoring (or RUM) allows you to see how users really interact with your applications and services. RUM can be extremely useful when you want to understand how your applications perform in the real world. While many teams can implement performance and load testing with synthetic data, this isn’t enough to determine how real users are experiencing your app.
RUM data can provide you with information about the following:
Which paths are your users following when they interact with your application?
How do factors like speed and responsiveness impact a user’s experience?
Where are your users located, and how does their geographic location affect their experience?
Which aspects and features of your application are used the most, and which ones are used the least? Do the latter present unexpected challenges for users?
It’s hard to overstate the value that RUM data provides when it comes to identifying users and understanding their usage patterns. To take it to the next level, you need to unlock the details of their interactions with your distributed system. That brings us to the topic of distributed tracing.
Introduction to Distributed Tracing
In a distributed system, a collection of services or microservices combine to form the application. Each microservice has a single purpose, which makes it relatively simple to build and maintain. As demand on the system increases, you can increase the quantity of each service to meet those demands, thereby providing additional capacity and expanding the system’s resiliency.
When problems occur, it can be challenging to identify the precise path that a request followed through the system and determine which instance of each service was involved in the interaction. If a single instance is faulty, it’s important to identify it and re-route traffic as quickly as possible.
Distributed tracing refers to the practice of assigning a unique identifier to each request. This identifier is known as a trace id. As the request travels through the system, the trace id is included, and associated with timestamps to determine processing time in each part of the requests journey, known as a span. The observability system collects the spans for analysis and reporting. The span data makes it easy to follow a requests path through a system, paints an accurate picture of the instances involved in the transaction and where any problems occurred. You can also use the trace id to investigate problems further from resources such as log files.
Troubleshooting with RUM and Distributed Tracing
At this point, it should be apparent that RUM and distributed tracing are both very valuable. RUM provides unique insights into how users interact with your system and how it’s performing. Distributed tracing shows the journey that a request takes through your system, including each service that it interacts with and in what order.
However, if you combine RUM and distributed tracing, you can multiply the benefits of each and dramatically alter the ways in which you monitor your applications, identify problems, and improve your responsiveness. You can do this by connecting the user identifiers to the trace ids of the requests. Let’s explore the benefits of doing just that.
Consolidating Data for Optimal Awareness and Problem Resolution
You can gain powerful insights into your systems if you can automatically combine your RUM and distributed tracing strategies. Monitoring error rates and response times can quickly alert you (and your response systems) to problems in real-time.
Using the data generated from RUM, you can rapidly identify which users are experiencing problems and determine whether they have something in common, such as:
- geographic location
- operating system & version
- browser type & version
- source ASN
- target URL path
You can analyze each of these data points to determine commonalities; however, it might also be that the problem exists on your backend. Suppose that you are automatically linking RUM data to the trace ids from your distributed tracing system. In that case, you will already have a collection of the requests that are experiencing problems. You can analyze the related APM data and logs to identify potential application anomalies that you need to address.
Putting This Into Practice
If this strategy sounds appealing to you and your teams, it might be easier (and more straightforward) to implement than you think. You might already have much of the infrastructure in place, for example, or you might be able to add the necessary support with minimal effort. You might also consider partnering with experts in the monitoring and data analytics space.
As an industry leader in monitoring and analytics, Splunk has the experience and expertise to help you get up and running with a RUM solution quickly and with relative ease. You can learn more about Splunk Real User Monitoring (RUM) as an integral part of Splunk’s Observability Cloud. Without a RUM solution, you’re essentially flying blind, and lack the data to fully understand and observe your customers' experiences with your product. Watch this demo and start your free trial of Splunk Observability Cloud today.