By Mike Mackrory
Distributed systems using the microservices architecture pattern have taken the world by storm. The emergence of the public cloud, the subsequent explosion of containers, and then the adoption of serverless technologies have further increased this approach's popularity.
Migrating to microservices has many advantages, but it comes with some challenges as well. One of those challenges is tracking a consumer's journey as the system routes their request. How you tackle this challenge becomes critical when a request fails or returns unexpected results. In this article, we will explore distributed tracing and how to implement it in our Java applications. We'll discuss how we can use this approach to monitor our applications' performance and identify opportunities to optimize performance.
The Good, The Bad and the Challenge of Distributed Systems
Before we dive into this topic, let's briefly discuss the different advantages and challenges of distributed systems. Distributed systems are generally broken down into specific services that perform a particular task. This component-based approach simplifies both the development and maintenance of the system. As traffic begins to flow through the system, we can scale out only those components and services that require additional resources. This dynamic nature allows the system to better handle any changing traffic patterns, which results in resource savings.
Monitoring and observability can be especially challenging with distributed systems. Each service can be composed of many separate nodes or instances. As the system routes consumer requests, it's incredibly challenging to understand precisely which nodes and services handle each request. If one part of the request fails, you need a way to know where the failure occurred and the ramifications.
In addition to monitoring and observability, we need to understand how the system performs as it handles consumer requests. The correlation between consumer traffic and application performance is essential to troubleshooting problems and optimizing performance.
Introducing Distributed Tracing
We can solve the problem of performance monitoring and observability by implementing distributed tracing. This approach is advantageous in ephemeral environments like serverless or Kubernetes, where the underlying infrastructure may only exist temporarily. A distributed tracing solution has two critical components; the trace ID and the span.
We used the term span to represent a specific unit of work within the system. Within the span, we can typically find a collection of data about the task related to the span, such as the name and timestamps indicating the work's start and end times. A span may also include references to other spans. Suppose our system manages a catalog of products. In that case, we might have a collection of spans which describe the original request and child requests to services such as availableInventory, price, sizes, shippingInformation, etc.
When a new request enters our system, it is assigned a unique identifier called a trace ID. As the request is processed by each service, we capture each part of the work in a span. Each span includes the trace ID. The trace ID connects all the spans into a complete trace. We can use the trace to track the request through our system, including analyzing the time required for each part of the process and which components handled the request.
From Distributed Tracing to Application Performance Monitoring
An Application Performance Monitoring (APM) system is a tool that allows you to aggregate, observe and monitor your distributed systems. APM systems can gather data from various sources, and distributed trace data helps gain insights and identify ways to optimize your applications.
The APM system aggregates the trace information to provide a holistic view of how our application processes requests. The APM tool aggregates our trace data, allowing us to report on system performance as a whole and at each service level. These aggregations are significant in establishing baselines of performance for each service and identifying trends over time.
Ideally, as we identify parts of the system where we could improve performance, we should see the processing time trend lower over time. We can also watch for trends where times increase, pointing to potential problems within our system. An effective APM system will automatically perform much of this analysis for you. You can also configure the system to alert support personnel or owners of a particular service if it encounters service degradation.
What Does This Mean For The Performance of Your Java Applications?
Now that we have explored distributed tracing, it’s time to discuss using these concepts within our Java applications to monitor and optimize them. Before we begin, it’s worth noting that the idea of distributed tracing and APM tools are applicable across many languages and platforms.
The benefit of a mature language, such as Java, is the availability of open-source tools and utilities to make our jobs easier. One of these tools is OpenTelemetry. OpenTelemetry includes a Java agent JAR that allows you to capture telemetry and other metrics from your code. OpenTelemetry also supports many other libraries and frameworks. The Java agent dynamically injects bytecode, and automatically captures telemetry from Hibernate, Kafka and many other supported libraries and frameworks.
In February of 2021, v1.0.0 of OpenTelemetry for tracing specification was reached, offering users a stable and well-tested version of the OpenTelemetry specification. There is also a Splunk distribution of OpenTelemetry Java that supports automatic capture and transmission of distributed tracing to Splunk APM, making it even easier to instrument and monitor your applications.
There are different ways in which you can instrument your code using OpenTelemetry. You can create your own span objects and inject the context and custom attributes into them using these instructions, or you can use OpenTelemetry’s auto-instrumentation to leverage your existing log4J log statements.
Unleashing the Power of Distributed Tracing
This article discussed distributed tracing and how we can use APM tools to collect, aggregate, and analyze our trace data. We can use tools like Splunk APM to identify trends, configure alerts, and often leverage built-in functionality to automate much of this process.
Finally, we discussed how you could integrate distributed tracing into your Java applications using OpenTelemetry. Once you've done this, you'll be able to understand your application's performance at the microservice level. You can also visualize and better understand the relationships between each microservice. With this in place, you can begin to look for opportunities to optimize your applications. Some areas you may want to look at are the microservices which take the longest time to process requests using a holistic approach such as Splunk Observability Cloud. You might also consider whether you should process requests in parallel versus sequentially.
Watch this demo on how the data you gather and analyze in your APM portal can provide continual analytics powered insights into your system with full stack end-to-end visibility. Sign up for your free trial today and start identifying areas that could benefit from optimization.