How does distributed tracing work?
To quickly grasp how distributed tracing works, it’s best to look at how it handles a single request. Tracing starts the moment an end user interacts with an application. When the user sends an initial request — an HTTP request, to use a common example — it is assigned a unique trace ID. As the request moves through the host system, every operation performed on it (called a “span” or a “child span”) is tagged with that first request’s trace ID, as well as its own unique ID, plus the ID of the operation that originally generated the current request (called the “parent span”).
Each span is a single step on the request’s journey and is encoded with important data relating to the microservice process that is performing that operation. These include:
- The service name and address of the process handling the request.
- Logs and events that provide context about the process’s activity.
- Tags to query and filter requests by session ID, database host, HTTP method, and other identifiers.
- Detailed stack traces and error messages in the event of a failure.
A distributed tracing tool like Zipkin or Jaeger (both of which we will explore in more detail in a bit) can correlate the data from all the spans and format them into visualizations that are available on request through a web interface.
Now think of a popular online video game with millions of users, the epitome of a modern microservices-driven app. It must track each end user's location, each interaction with other players and the environment, every item the player acquires, end time, and a host of other in-game data. Keeping the game running smoothly would be unthinkable with traditional tracing methods. But distributed request tracing makes it possible.
What are the benefits of distributed tracing solutions?
The primary benefit of distributed tracing is its ability to bring coherence to distributed systems, leading to a host of other benefits. These include:
- Increased productivity: The disjointed nature of microservice architectures makes performance monitoring functions — such as tracking down and fixing problems — time consuming and expensive compared to monolithic applications. Additionally, the way failure data is delivered in microservices isn’t always clear and often requires developers to decipher issues from error messages and arcane status codes. Distributed tracing provides a more holistic view of distributed systems, reducing the time developers spend diagnosing and debugging request failures. Locating and fixing sources of errors also becomes more efficient.
- Better cross-team collaboration: Each process in a microservice environment is developed by a specialized team for the technology used in that service, creating challenges when determining where an error occurred and who was responsible for correcting it. Distributed tracing helps eliminate these data silos and the productivity bottlenecks and other performance issues they create, while accelerating response time and enabling teams to work together more effectively.
- Flexible implementation: Distributed tracing tools work with a wide variety of applications and programming languages, so developers can incorporate them into virtually any microservices system and view data through one tracing application.
What are the different types of tracing tools?
- Code tracing: Code tracing refers to a programmer’s interpretation of the results of each line of code in an application and recording its effect by hand instead of a debugger — which automates the process — to trace a program’s execution. Manually tracing small blocks of code can be more efficient because the programmer doesn’t need to run the entire program to identify the effects of small edits.
- Data tracing: Data tracing helps check the accuracy and data quality of critical data elements (CDEs), trace them back to their source systems, and monitor and manage them using statistical methods. Typically, the best way to perform accuracy checks is to trace operations to their origins and validate them with source data — although historically this hasn’t been cost-effective in large operational processes. Instead, statistical process control (SPC) can be used to prioritize, trace, monitor, and control CDEs.
- Program trace (ptrace): A program trace is an index of the instructions executed and data referenced during the running of an application. The information displayed in a program trace includes the program name, language, and the source statement that was executed, among other data, and is used in the process of debugging an application.
In this context, centralized logging refers to the aggregation of data from individual microservices in a central location for easier access and analysis.
One of the most tedious but critical jobs for developers is combing through an application’s log files to find errors that are causing or contributing to a problem. This can become particularly arduous in a microservices environment.
As mentioned earlier, traditional monitoring methods work well with monolithic applications because you are tracking a single codebase. It stands to reason that the same methods could be applied to a microservice architecture by treating each microservice as a small monolith and relying on its application and system log data to diagnose issues. The problem with this approach is that it only captures data for that individual service and lets you fix problems only with that particular process, hindering response time.
Centralized logging collects and aggregates logs from multiple services into a central location where they are indexed in a database. The log data can be searched, filtered, and grouped in the log management software by fields like status, host, severity, origin, and timestamp.
Centralized logging has a number of advantages in a distributed system. Having all relevant logs in one place greatly reduces the amount of time and energy developers must spend hunting down the root cause of an application issue. Because it organizes logs into meaningful data rather than just text, it allows for more refined, sophisticated queries and also provides a clearer perspective of system performance as a whole.
Distributed logging is the practice of keeping log files decentralized. There are a few reasons why this might be preferable to centralized logging.
For one, shipping logs across a network to a central location can consume a lot of bandwidth. Depending on your network and the number and frequency of logs being generated, that could cause centralizing logs to compete with more critical applications and processes. Some log storage systems also work more reliably when they are closer to the device generating the log files.
Distributed logging may also be preferred for large-scale systems. Applications with many microservices by nature generate a lot of log messages, making centralized logging more burdensome and less cost effective.
How does microservices logging work?
Microservices logging is guided by a set of best practices that address the loosely coupled, modular nature of microservice architecture. The goal is to bring coherence to the system for more efficient and accurate troubleshooting and debugging.
Microservices logging usually incorporates the following practices:
- Correlating requests: Each service in a microservice system interacts with the others to fulfill a request. Tagging the initial request with a unique ID allows you to easily track it through the system, identify potential errors and reveal whether they were caused by the previous service request or the next one. A developer can enter that unique ID into the log aggregator search engine to pull up the logs from all services for analysis.
- Logging information: More log information means more context to help the user understand a problem. The name of the service generating the log message, correlation ID, the IP address of the server and the client making the request, and the date and time the message was sent and received are just a few of the data points you should consider including.
- Centralizing logs: Having to access and correlate logs from individual servers drains valuable time and energy that increases exponentially as the number or microsystems grows. Centralized logging solves this problem. Also, if a server or container is terminated without warning, their logs also disappear. With centralized logging, logs are sent to a central repository every few minutes, preventing the chances of irreparable loss.
OpenTracing and OpenCensus competed as open source distributed tracing projects that were recently merged into a single tool called Open Telemetry.
Hosted by the Cloud Native Computing Foundation (CNCF), OpenTracing attempts to provide a standardized API for tracing, enable developers to embed instrumentation in commonly used libraries or their own custom code without vendor lock-in. Though this provided much-desired flexibility, the API’s sole focus on tracing made it of limited use on its own and led to inconsistent implementations by developers and vendors.
OpenCensus was developed at Google and was based on its internal tracing platform. Once it was open sourced, Microsoft, along with other vendors and contributors, began directing the standard. OpenCensus is a set of multi-language libraries that collects metrics about application behavior, transferring that data to any backend analysis platform of the developer’s choosing. It can also trace messages, requests, and services from their source to their destinations. With no API available to embed OpenCensus into code, developers used community-built automatic instrumentation agents for the task.
Open Telemetry, which is managed by CNCF, merges the code bases of OpenTracing and OpenCensus, relying on the strengths of each. Currently in beta, OpenTelemetry offers “a single set of APIs, libraries, agents, and collector services” for capturing distributed traces and metrics from an application that can be analyzed using popular observability tools. In the near future, OpenTelemetry will add logging capability to its data capture support.
Jaeger and Zipkin are two popular open-source request tracing tools, each with similar components: a collector, datastore, query API, and web user interface. Outgoing requests are traced along with the application. The collector then records and correlates the data between different traces and sends it to a database where it can be queried and analyzed through the UI.
AWS X-Ray is the native distributed tracing tool for Amazon Web Services (AWS). As the world’s largest cloud service provider, Amazon was at the forefront of the movement from monolithic to microservice-driven applications, and as such, developed its own tracing tool.
As with similar tools, AWS X-Ray traces user requests through an application, collecting data that can help find the cause of latency issues, errors, and other problems. This trace data is formatted into a service map that developers can parse to locate and identify problems.
Naturally, AWS X-Ray works well with other Amazon services such as AWS Lambda, Amazon EC2 (Elastic Compute Cloud), Amazon EC2 Container Service (Amazon ECS), and AWS Elastic Beanstalk. It can be used in both an app’s build stage and testing stages, as well as servicing the app once it’s in production.
Kafka is a distributed streaming platform, providing a high-throughput, low-latency platform for handling real-time data feeds, often used in microservice architectures. It’s used to process streams of records in real time, publish and subscribe to those record streams in a manner similar to a message queue, and store them in a “fault-tolerant durable way.”
Kafka uses “topics” — a category or feed name to which records are published — to abstract streams of records. For each topic, Kafka maintains a partitioned log, an ordered, continually appended sequence of records that can serve as an external commit log for a distributed system.
What are the best log aggregation & monitoring tools?
While there are several good log aggregation and monitoring tools on the market today, these are some of the most popular.
Elastic (formerly ELK: ElasticSearch, Logstash, Kibana): One of the most popular stacks for distributed systems, Elastic combines three essential tools. Logstash aggregates log files, ElasticSearch lets you index and search through the data, and Kibana provides a data visualization dashboard. Open source and free, you can implement the entire stack or use the tools individually.
Loggly: This cloud-hosted log manager and analyzer was built by and for DevOps folks. It was designed to handle huge volumes of log data via an easy-to-navigate interface and is primarily used for troubleshooting and customer support. It also comes with a RESTful API, allowing it to be integrated into other tools.
PaperTrail: PaperTrail doesn’t aggregate logs but rather gives the end user an easy way to comb through the ones you’re already collecting. It’s easy to install and has a clean interface that gives you a consolidated view of data from the browser, command line, or an API.
Graylog: Another open source log analyzer, Graylog was created expressly to help developers find and fix errors in their applications. It has a simple UI that’s built for speed, and it can manage a wide range of data formats.