Data Insider

What is Distributed Tracing?

Distributed tracing, sometimes called distributed request tracing, is a method to monitor applications built on a microservices architecture.

IT and DevOps teams use distributed tracing to follow the course of a request or transaction as it travels through the application that is being monitored. This allows them to pinpoint bottlenecks, bugs, and other issues that impact the application’s performance.

Tracing is a fundamental process in software engineering, used by programmers along with other forms of logging, to gather information about an application’s behavior. But traditional tracing runs into problems when it is used to troubleshoot applications built on a distributed software architecture. Because microservices scale independently, it’s common to have multiple iterations of a single service running across different servers, locations, and environments simultaneously, creating a complex web through which a request must travel. These requests are nearly impossible to track with traditional techniques designed for a single service application.

Distributed tracing solutions solve this problem, and numerous other performance issues, because it can track requests through each service or module and provide an end-to-end narrative account of that request. Analysts, SREs, developers and others can observe each iteration of a function, enabling them to conduct performance monitoring by seeing which instance of that function is causing the app to slow down or fail, and how to resolve it.

In the pages that follow, we’ll take a deep dive into distributed tracing and the technologies used to make it possible in your enterprise.

 

How does distributed tracing work?

To quickly grasp how distributed tracing works, it’s best to look at how it handles a single request. Tracing starts the moment an end user interacts with an application. When the user sends an initial request — an HTTP request, to use a common example — it is assigned a unique trace ID. As the request moves through the host system, every operation performed on it (called a “span” or a “child span”) is tagged with that first request’s trace ID, as well as its own unique ID, plus the ID of the operation that originally generated the current request (called the “parent span”).

Each span is a single step on the request’s journey and is encoded with important data relating to the microservice process that is performing that operation. These include:

  • The service name and address of the process handling the request.
  • Logs and events that provide context about the process’s activity.
  • Tags to query and filter requests by session ID, database host, HTTP method, and other identifiers.
  • Detailed stack traces and error messages in the event of a failure.

A distributed tracing tool like Zipkin or Jaeger (both of which we will explore in more detail in a bit) can correlate the data from all the spans and format them into visualizations that are available on request through a web interface.

Now think of a popular online video game with millions of users, the epitome of a modern microservices-driven app. It must track each end user's location, each interaction with other players and the environment, every item the player acquires, end time, and a host of other in-game data. Keeping the game running smoothly would be unthinkable with traditional tracing methods. But distributed request tracing makes it possible.

 

What are the benefits of distributed tracing solutions?

The primary benefit of distributed tracing is its ability to bring coherence to distributed systems, leading to a host of other benefits. These include:

  • Increased productivity: The disjointed nature of microservice architectures makes performance monitoring functions — such as tracking down and fixing problems — time consuming and expensive compared to monolithic applications. Additionally, the way failure data is delivered in microservices isn’t always clear and often requires developers to decipher issues from error messages and arcane status codes. Distributed tracing provides a more holistic view of distributed systems, reducing the time developers spend diagnosing and debugging request failures. Locating and fixing sources of errors also becomes more efficient.
  • Better cross-team collaboration: Each process in a microservice environment is developed by a specialized team for the technology used in that service, creating challenges when determining where an error occurred and who was responsible for correcting it. Distributed tracing helps eliminate these data silos and the productivity bottlenecks and other performance issues they create, while accelerating response time and enabling teams to work together more effectively.
  • Flexible implementation: Distributed tracing tools work with a wide variety of applications and programming languages, so developers can incorporate them into virtually any microservices system and view data through one tracing application.

What are the different types of tracing tools?

  • Code tracing: Code tracing refers to a programmer’s interpretation of the results of each line of code in an application and recording its effect by hand instead of a debugger — which automates the process — to trace a program’s execution. Manually tracing small blocks of code can be more efficient because the programmer doesn’t need to run the entire program to identify the effects of small edits.
  • Data tracing: Data tracing helps check the accuracy and data quality of critical data elements (CDEs), trace them back to their source systems, and monitor and manage them using statistical methods. Typically, the best way to perform accuracy checks is to trace operations to their origins and validate them with source data — although historically this hasn’t been cost-effective in large operational processes. Instead, statistical process control (SPC) can be used to prioritize, trace, monitor, and control CDEs.
  • Program trace (ptrace): A program trace is an index of the instructions executed and data referenced during the running of an application. The information displayed in a program trace includes the program name, language, and the source statement that was executed, among other data, and is used in the process of debugging an application.
What is centralized logging?

In this context, centralized logging refers to the aggregation of data from individual microservices in a central location for easier access and analysis.

One of the most tedious but critical jobs for developers is combing through an application’s log files to find errors that are causing or contributing to a problem. This can become particularly arduous in a microservices environment.

As mentioned earlier, traditional monitoring methods work well with monolithic applications because you are tracking a single codebase. It stands to reason that the same methods could be applied to a microservice architecture by treating each microservice as a small monolith and relying on its application and system log data to diagnose issues. The problem with this approach is that it only captures data for that individual service and lets you fix problems only with that particular process, hindering response time.

Centralized logging collects and aggregates logs from multiple services into a central location where they are indexed in a database. The log data can be searched, filtered, and grouped in the log management software by fields like status, host, severity, origin, and timestamp.

Centralized logging has a number of advantages in a distributed system. Having all relevant logs in one place greatly reduces the amount of time and energy developers must spend hunting down the root cause of an application issue. Because it organizes logs into meaningful data rather than just text, it allows for more refined, sophisticated queries and also provides a clearer perspective of system performance as a whole.

 
What is distributed logging?

Distributed logging is the practice of keeping log files decentralized. There are a few reasons why this might be preferable to centralized logging.

For one, shipping logs across a network to a central location can consume a lot of bandwidth. Depending on your network and the number and frequency of logs being generated, that could cause centralizing logs to compete with more critical applications and processes. Some log storage systems also work more reliably when they are closer to the device generating the log files.

Distributed logging may also be preferred for large-scale systems. Applications with many microservices by nature generate a lot of log messages, making centralized logging more burdensome and less cost effective.

 

How does microservices logging work?

Microservices logging is guided by a set of best practices that address the loosely coupled, modular nature of microservice architecture. The goal is to bring coherence to the system for more efficient and accurate troubleshooting and debugging.

Microservices logging usually incorporates the following practices:

  • Correlating requests: Each service in a microservice system interacts with the others to fulfill a request. Tagging the initial request with a unique ID allows you to easily track it through the system, identify potential errors and reveal whether they were caused by the previous service request or the next one. A developer can enter that unique ID into the log aggregator search engine to pull up the logs from all services for analysis.
  • Logging information: More log information means more context to help the user understand a problem. The name of the service generating the log message, correlation ID, the IP address of the server and the client making the request, and the date and time the message was sent and received are just a few of the data points you should consider including.
  • Structuring log data: One of the advantages of a microservice architecture is the ability to use different technology stacks. However, the resulting numerous log formats often create significant challenges around analysis. Structuring the data in a standard format, like JavaScript Object Notation (JSON) for example, will make them easier to parse and allow you search them by a variety of fields from a central location.
  • Centralizing logs: Having to access and correlate logs from individual servers drains valuable time and energy that increases exponentially as the number or microsystems grows. Centralized logging solves this problem. Also, if a server or container is terminated without warning, their logs also disappear. With centralized logging, logs are sent to a central repository every few minutes, preventing the chances of irreparable loss.
 
What are the open distributed tracing standards (OpenTracing, OpenCensus, OpenTelemetry)?

OpenTracing and OpenCensus competed as open source distributed tracing projects that were recently merged into a single tool called Open Telemetry.

Hosted by the Cloud Native Computing Foundation (CNCF), OpenTracing attempts to provide a standardized API for tracing, enable developers to embed instrumentation in commonly used libraries or their own custom code without vendor lock-in. Though this provided much-desired flexibility, the API’s sole focus on tracing made it of limited use on its own and led to inconsistent implementations by developers and vendors.

OpenCensus was developed at Google and was based on its internal tracing platform. Once it was open sourced, Microsoft, along with other vendors and contributors, began directing the standard. OpenCensus is a set of multi-language libraries that collects metrics about application behavior, transferring that data to any backend analysis platform of the developer’s choosing. It can also trace messages, requests, and services from their source to their destinations. With no API available to embed OpenCensus into code, developers used community-built automatic instrumentation agents for the task.

Open Telemetry, which is managed by CNCF, merges the code bases of OpenTracing and OpenCensus, relying on the strengths of each. Currently in beta, OpenTelemetry offers “a single set of APIs, libraries, agents, and collector services” for capturing distributed traces and metrics from an application that can be analyzed using popular observability tools. In the near future, OpenTelemetry will add logging capability to its data capture support.

 
What is Jaeger or Zipkin tracing?

Jaeger and Zipkin are two popular open-source request tracing tools, each with similar components: a collector, datastore, query API, and web user interface. Outgoing requests are traced along with the application. The collector then records and correlates the data between different traces and sends it to a database where it can be queried and analyzed through the UI.

Jaegar and Zipkin are differentiated by their architecture and programming language support — Jaeger is implemented in Go and Zipkin in Java. Zipkin supports virtually every programming language with dedicated libraries for Java, Javascript, C, C++, C#, Python, Go, Scala, and others. Jaeger’s supported-language list is shorter: C#, Java, Node.js, Python, and Go.

 
What is AWS X-Ray?

AWS X-Ray is the native distributed tracing tool for Amazon Web Services (AWS). As the world’s largest cloud service provider, Amazon was at the forefront of the movement from monolithic to microservice-driven applications, and as such, developed its own tracing tool.

As with similar tools, AWS X-Ray traces user requests through an application, collecting data that can help find the cause of latency issues, errors, and other problems. This trace data is formatted into a service map that developers can parse to locate and identify problems.

Naturally, AWS X-Ray works well with other Amazon services such as AWS Lambda, Amazon EC2 (Elastic Compute Cloud), Amazon EC2 Container Service (Amazon ECS), and AWS Elastic Beanstalk. It can be used in both an app’s build stage and testing stages, as well as servicing the app once it’s in production.

 
What is a log in Kafka?

Kafka is a distributed streaming platform, providing a high-throughput, low-latency platform for handling real-time data feeds, often used in microservice architectures. It’s used to process streams of records in real time, publish and subscribe to those record streams in a manner similar to a message queue, and store them in a “fault-tolerant durable way.”

Kafka uses “topics” — a category or feed name to which records are published — to abstract streams of records. For each topic, Kafka maintains a partitioned log, an ordered, continually appended sequence of records that can serve as an external commit log for a distributed system.

 

What are the best log aggregation & monitoring tools?

While there are several good log aggregation and monitoring tools on the market today, these are some of the most popular.

Elastic (formerly ELK: ElasticSearch, Logstash, Kibana): One of the most popular stacks for distributed systems, Elastic combines three essential tools. Logstash aggregates log files, ElasticSearch lets you index and search through the data, and Kibana provides a data visualization dashboard. Open source and free, you can implement the entire stack or use the tools individually.

Loggly: This cloud-hosted log manager and analyzer was built by and for DevOps folks. It was designed to handle huge volumes of log data via an easy-to-navigate interface and is primarily used for troubleshooting and customer support. It also comes with a RESTful API, allowing it to be integrated into other tools.

PaperTrail: PaperTrail doesn’t aggregate logs but rather gives the end user an easy way to comb through the ones you’re already collecting. It’s easy to install and has a clean interface that gives you a consolidated view of data from the browser, command line, or an API.

Graylog: Another open source log analyzer, Graylog was created expressly to help developers find and fix errors in their applications. It has a simple UI that’s built for speed, and it can manage a wide range of data formats.

 

The bottom line

Distributed tracing is essential for distributed apps

The advantages of microservices for building cloud-based applications are well documented and adoption shows no signs of slowing. As these systems grow more complex, distributed request tracing offers a huge advantage over the older, needle-in-a-haystack approach to tracking down the problems that could disrupt your services. If you’re responsible for a microservice-based system, equipping your enterprise with this powerful tool will transform how you do your job.