This e-book will show you seven things to consider to ensure your containers are production-ready.
Published Date: January 26, 2023
Distributed tracing, also known as distributed request tracing, is a method of monitoring and observing service requests in applications built on a microservices architecture. Distributed tracing is used by IT and DevOps teams to track requests or transactions through the application they are monitoring — gaining vital end-to-end observability into that journey. This lets them identify any issues, including bottlenecks and bugs, that could be having a negative impact on the application’s performance and affect user experience.
(Visit the Splunk Observability page to learn more about the only full-stack, analytics-powered and OpenTelemetry-native observability solution.)
Tracing is a basic and important process in software engineering to gather more information about an application’s performance, but it can be less effective when used with applications built on a distributed software architecture, such as microservices. Microservices, because of the way they are constructed, scale independently from one another. Therefore, it’s normal to have multiple instances of a single service running at the same time on different servers, in different locations and different environments. Requests that come from an environment like this are nearly impossible to monitor using traditional, single-service methods.
Distributed tracing solves this problem by tracking end-user requests through each service or module and providing a holistic view of the request. Anyone wishing to monitor the request (analysts, software reliability engineers, developers and others) can observe each iteration of a function and conduct performance monitoring by noting which instance of a function is causing the issue.
In the pages that follow, we’ll take a deep dive into distributed tracing and the technologies used to make it possible in your enterprise.
How does distributed tracing work?
To understand the process of distributed tracing, it helps to understand first how a single request is handled. Tracing starts the moment an end user interacts with an application. When the user sends an initial request — adding an item to their cart, for example — it is assigned a unique trace ID. As the request moves through the host system, every operation performed on it (called a “span” or a “child span”) is tagged with that first request’s trace ID, as well as its own unique ID, plus the ID of the operation that originally generated the current request (called the “parent span”).
Each span represents one segment of the request’s path and includes important information related to the service performing the operation. These can include:
- The name and address of the process handling the request.
- Logs and events that provide context about the process’s activity.
- Tags to query and filter requests by session ID, database host, HTTP method, and other identifiers.
- Detailed stack traces and error messages in the event of a failure.
A distributed tracing tool like Zipkin or Jaeger (both of which we will explore in more detail in a bit) correlates the data from all the spans and formats them into visualizations that are available on request through a web interface or provided automatically through alerting or AIOps tools.
Now think of a microservice-based popular online video game with millions of users, which will need to keep track of each player’s location, every interaction they have with each other, the items they pick up in the game and a variety of other data generated during play. Keeping the game running smoothly would be unthinkable with traditional tracing methods. But distributed request tracing makes it possible.
A typical microservice architecture
What are the benefits of distributed tracing?
The main benefit of distributed tracing is visibility into real user transactions in one place regardless of the complexity of your underlying application or infrastructure. Some benefits that come from a more holistic approach include:
- Increased productivity: The disjointed nature of microservice architectures makes application performance monitoring — including functions such as tracking down and fixing performance issues — time consuming and expensive compared to monolithic applications. Additionally, the way failure data is delivered in microservices isn’t always clear and often requires developers to decipher issues from error messages and arcane status codes. Distributed tracing provides a more holistic view of distributed systems, reducing the time developers spend diagnosing and debugging request failures and latencies. Troubleshooting root cause also becomes more efficient, helping improve mean time to recovery/repair (MTTR).
- Improved collaboration among teams: In a microservice environment, each process is generally the responsibility of a particular team. This can cause problems when it becomes necessary to identify errors and determine who is responsible for fixing them. Distributed tracing helps identify which team should be responsible for fixing issues, while accelerating response time and enabling teams to work together more effectively.
- Flexible implementation: Distributed tracing tools work with a wide variety of applications and programming languages, so developers can incorporate them into virtually any system and view data through one tracing application.
What are the different types of tracing tools?
- Code tracing: Code tracing refers to a programmer’s interpretation of the results of each line of code in an application and recording its effect by hand instead of a debugger — which automates the process — to trace a program’s execution. Manually tracing small blocks of code can be more efficient because the programmer doesn’t need to run the entire program to identify the effects of small edits.
- Data tracing: Data tracing helps check the accuracy and data quality of critical data elements (CDEs), such as XXX, trace them back to their source systems, and monitor and manage them using statistical methods. Typically, the best way to perform accuracy checks is to trace operations to their origins and validate them with source data — although historically this hasn’t been cost-effective in large operational processes.
- Program trace (ptrace): A program or stack trace is an index of the instructions executed and data referenced during the running of an application. The information displayed in a program trace includes the program name, language, and the source statement that was executed, among other data, and is used in the process of debugging an application.
What is distributed logging?
With distributed logging, log files are not centralized but rather kept separate. This can be advantageous for a number of reasons, compared to centralized logging.
For one, shipping logs across a network to a central location can consume a lot of bandwidth. Depending on your network and the number and frequency of logs being generated, that could cause centralizing logs to compete with more critical applications and processes.
Distributed logging may also be preferable for large-scale systems. If an application uses many microservices it will necessarily generate more log messages. In this case, distributed logging may be more efficient and cost effective.
Distributed tracing tools also have the ability to draw from other data sources, such as metrics and traces, so that logs only need to be reviewed for specific services after those services were identified as problematic.
How does microservices logging work?
Microservices logging is guided by a set of best practices that address the loosely coupled, modular nature of microservice architecture. The goal is to bring coherence to the system for more efficient and accurate troubleshooting and debugging.
Correlating requests: Each service in a microservice system interacts with the others to fulfill a request. Tagging the initial request with a unique ID allows you to easily track it through the system, identify potential errors and reveal whether they were caused by the previous service request or the next one. A developer can enter that unique ID into the log aggregator search engine to pull up the logs from all services for analysis.
Logging information: More log information means more context to help the user understand a problem. The name of the service generating the log message, correlation ID, the IP address of the server and the client making the request, and the date and time the message was sent and received are just a few of the data points you should consider including.
Structuring log data: One of the advantages of a microservice architecture is the ability to use different technology stacks. However, the resulting numerous log formats often create significant challenges around analysis. Structuring the data in a standard format, like JavaScript Object Notation (JSON) for example, will make them easier to parse and allow you to search them by a variety of fields from a central location.
Centralizing logs: Having to access and correlate logs from individual servers drains valuable time and energy that increases exponentially as the number or microsystems grows. Centralized logging solves this problem. Also, if a server or container is terminated without warning, their logs also disappear. With centralized logging, logs are sent to a central repository every few minutes, preventing the chances of irreparable loss.
What are the open distributed tracing standards (OpenTracing, OpenCensus, OpenTelemetry)?
OpenTracing and OpenCensus competed as open source distributed tracing projects that were recently merged into a single tool called Open Telemetry.
Hosted by the Cloud Native Computing Foundation (CNCF), OpenTracing attempts to provide a standardized API for tracing, letting developers embed instrumentation in commonly used libraries or their own custom code without vendor lock-in. Though this provided much-desired flexibility, the API’s sole focus on tracing made it of limited use on its own and led to inconsistent implementations by developers and vendors.
OpenCensus was developed at Google and was based on its internal tracing platform. Once it was open sourced, Microsoft, along with other vendors and contributors, began directing the standard. OpenCensus is a set of multi-language libraries that collects metrics about application behavior, transferring that data to any backend analysis platform of the developer’s choosing. It can also trace messages, requests, and services from their source to their destinations. With no API available to embed OpenCensus into code, developers used community-built automatic instrumentation agents for the task.
Open Telemetry, which is managed by CNCF, merges the code bases of OpenTracing and OpenCensus, relying on the strengths of each. Currently in beta, OpenTelemetry offers “a single set of APIs, libraries, agents, and collector services” for capturing distributed traces and metrics from an application that can be analyzed using popular observability tools. In the near future, OpenTelemetry will add logging capability to its data capture support.
If you want to read more about Open Telemetry, you may enjoy Splunk’s children’s book, Amir and the Magical Lens.
What is Jaeger or Zipkin tracing?
Jaeger and Zipkin are two popular open-source request tracing tools, each with similar components: a collector, datastore, query API, and web user interface. Outgoing requests are traced along with the application. The collector then records and correlates the data between different traces and sends it to a database where it can be queried and analyzed through the UI.
Jaegar and Zipkin are differentiated by their architecture and programming language support — Jaeger is implemented in Go and Zipkin in Java. Zipkin supports virtually every programming language with dedicated libraries for Java, Javascript, C, C++, C#, Python, Go, Scala, and others. Jaeger’s supported-language list is shorter: C#, Java, Node.js, Python, and Go.
What is AWS X-Ray?
AWS X-Ray is the native distributed tracing tool for Amazon Web Services (AWS). As the world’s largest cloud service provider, Amazon was at the forefront of the movement from monolithic to microservice-driven applications, and as such, developed its own tracing tool.
As with similar tools, AWS X-Ray traces user requests through an application, collecting data that can help find the cause of latency issues, errors, and other problems. This trace data is formatted into a service map that developers can parse to locate and identify problems.
Naturally, AWS X-Ray works well with other Amazon services such as AWS Lambda, Amazon EC2 (Elastic Compute Cloud), Amazon EC2 Container Service (Amazon ECS), and AWS Elastic Beanstalk. It can be used in both an app’s build stage and testing stages, as well as servicing the app once it’s in production.
What is a log in Kafka?
Kafka is a distributed streaming platform, providing a high-throughput, low-latency platform for handling real-time data feeds, often used in microservice architectures. It’s used to process streams of records in real time, publish and subscribe to those record streams in a manner similar to a message queue, and store them in a “fault-tolerant durable way.”
Kafka uses “topics” — a category or feed name to which records are published — to abstract streams of records. For each topic, Kafka maintains a partitioned log, an ordered, continually appended sequence of records that can serve as an external commit log for a distributed system.
What are the best log aggregation & monitoring tools?
While there are several good log aggregation and monitoring tools on the market today, these are some of the most popular.
Elastic (formerly ELK: ElasticSearch, Logstash, Kibana): One of the most popular stacks for distributed systems, Elastic combines three essential tools. Logstash aggregates log files, ElasticSearch lets you index and search through the data, and Kibana provides a data visualization dashboard. Open source and free, you can implement the entire stack or use the tools individually.
Loggly: This cloud-hosted log manager and analyzer was built by and for DevOps folks. It was designed to handle huge volumes of log data via an easy-to-navigate interface and is primarily used for troubleshooting and customer support. It also comes with a RESTful API, allowing it to be integrated into other tools.
PaperTrail: PaperTrail doesn’t aggregate logs but rather gives the end user an easy way to comb through the ones you’re already collecting. It’s easy to install and has a clean interface that gives you a consolidated view of data from the browser, command line, or an API.
Graylog: Another open source log analyzer, Graylog was created expressly to help developers find and fix errors in their applications. It has a simple UI that’s built for speed, and it can manage a wide range of data formats.
The advantages and ever-expanding roster of use cases of microservices for building cloud-based applications are well documented and adoption shows no signs of slowing. As these systems grow more complex, distributed request tracing offers a huge advantage over the older, needle-in-a-haystack approach to tracking down the problems that could disrupt your services. If you’re responsible for a microservice-based system, equipping your enterprise with this powerful tool can optimize how you do your job.

Splunk Observability and IT Predictions 2023
Splunk leaders and researchers weigh in on the the biggest industry observability and IT trends we’ll see this year.