DEVOPS

What Is Distributed Tracing and Why You Need It

It is no surprise that monitoring workloads are top of mind for many organizations to ensure a successful customer experience. As our applications become more distributed and cloud-native, we find that monitoring can become more complex. A single user transaction fans out to interact with tens or hundreds of microservices, each one requesting data from backend data stores or otherwise interacting with each other and other parts of your infrastructure. Determining exactly what it takes to fill a user’s request becomes more challenging over time.

How can we easily find bottlenecks in our systems and clearly understand where time is spent? Attaining observability into these modern environments sounds like a daunting task, but here is where instrumenting your applications to generate spans and traces can help.

What is a Trace, and Why is it Important?

A trace is a collection of transactions (or spans) representing a unique user or API transaction handled by an application and its constituent services. One trace represents one user interaction. The trace is made up of a collection of spans – single operations, which contain a beginning and ending time, a trace ID to correlate them to the specific user transaction involved, as well as some identifier (or tag) to add additional information about the request, like the particular version of microservice that generated the span. Traces are distributed across different services, so the industry refers to this process of tagging spans and correlating them together as distributed tracing. Distributed tracing follows a request (transaction) as it moves between multiple services within a microservices architecture, allowing engineers to help identify where the service request originates from (user-facing frontend application) throughout its journey with other services. Because customer experience is so vital and modern architecture is so complex (one user transaction can require services hosted on-premise, in multiple clouds, or even serverless function calls) access to this telemetry is essential. You’ll have better visibility into where your application is spending the most time and easily identify bottlenecks that may affect application performance.

Let’s consider a simple client-server application. The client will begin by sending a request over to the server for a specific customer. The server then processes the request, and the response is then sent back to the client. Within the context of the client, a single action has occurred. They sent a request and got a response. We observe each server request generated as a result of this client request in a span. As the client performs different transactions with the server in the context of the application, more spans are generated, and we correlate them together within a trace context. The trace context is the glue that holds the spans together.

Take a look at the breakdown below. 

  1. Client sends a customer name request to Server at time: X (Trace Context: customerrequest1, SpanID: 1, timestamp: X)
  2. Server receives customer name request from Client at time: Y (Trace Context: customerrequest1, SpanID: 2, timestamp: Y)
  3. Server parses the request from the Client at time: Z (Trace Context: customerrequest1, SpanID: 3, timestamp: Z)
     

Note that the trace context remains the same, tying each span together and letting the infrastructure know that each span belongs to the same transaction.

How Do We Generate Traces? 

To gather traces, applications must be instrumented. Instrumenting an application requires using a framework like OpenTelemetry to generate traces and measure application performance to discover where time is spent and locate bottlenecks quickly. With applications consisting of different coding languages, distributed microservices, and written by other individuals worldwide, it helps to have an open vendor agnostic framework you can use to instrument your applications. For many languages, OpenTelemetry provides automatic instrumentation of your application, where others must be manually instrumented. The good news is that OpenTelemetry is the industry standard for observability data, so you’ll only have to do instrumentation work one time, no matter which observability vendor you choose.

How Do I Collect and Export My Traces?

Once your application has been instrumented, you’ll want to begin collecting this telemetry using a collector. The Splunk OpenTelemetry collector is a great example. The collector provides a unified way to receive, process, and export application telemetry to an analysis tool like Splunk APM, where you can create dashboards, business workflows and identify critical metrics. 

One popular feature we see our customers using Splunk APM for is Tag Spotlight. Tag Spotlight allows you to quickly correlate events like increases in latency or errors with tag values, providing a one-stop-shop to understand how traces are behaving across your entire application. Another great feature is the AI-driven approach to sift through trace data in seconds and immediately highlight which microservice is responsible for errors within the dynamic service map. In the image below, can you guess which microservice is ultimately responsible for the errors in the application? Tag Spotlight lets you dive even deeper than this to determine which version of paymentService is responsible.

Splunk APM Dynamic Service Map

To see an example of automatic instrumentation with the Splunk OpenTelemetry Collector, check out my recent blog post, How to Instrument a Java App Running in Amazon EKS, where we auto instrument a basic Java application running in Amazon EKS and review trace data using Splunk APM.

Conclusion

Traditional monitoring tools focused on monitoring monolithic applications are unable to serve the complex cloud-native architectures of today. Give distributed tracing a try for your application today with Splunk Observability. You can sign up to start a free trial of the suite of products – from Infrastructure Monitoring and APM to Real User Monitoring and Log Observer. Get a real-time view of your tracing telemetry and start solving problems faster today.

Additionally, if you’d like to learn more about OpenTelemetry, why not try our OpenTelemetry game, Pipe Storm? Through December 10, 2021, you can play and win a t-shirt and maybe even an Oculus Quest 2. (Some restrictions apply, see game page for details.)


Glossary

  • Trace: A trace is a collection of transactions (or spans) representing a unique user or API transaction handled by an application and its constituent services.
  • Span: A single operation within a trace, which contains a beginning and ending time.
  • Tag: A unique identifier to include additional information and context of a span.
  • Trace Context: A unique identifier to help identify a trace.

Johnathan is part of the Observability Practitioner team at Splunk, and is here to help tell the world about Observability. Johnathan’s career has taken him from IT Administration to DevOps Engineer to Product Marketing Management. In addition to Observability, Johnathan’s professional interests include training, DevOps culture, and public speaking. Johnathan holds a Bachelor’s Degree of Science in Network Administration from Western Governors University.

TAGS

What Is Distributed Tracing and Why You Need It

Show All Tags
Show Less Tags

Join the Discussion