For IT Operations and Site Reliability Engineering (SRE) teams, logging is nothing new. In fact, collecting and analyzing logs is one of the oldest cornerstones of performance management. Logs have been part and parcel of APM workflows for decades.
Yet the logging strategies that worked in eras past often fall short today. That’s thanks to the advent of cloud-native computing, which has ushered in fundamental new challenges in the way teams aggregate, analyze, and manage logs.
Optimizing application reliability and performance in today’s world requires an overhaul of your team’s approach to logging. Not only do you need to manage logs on an unprecedented scale, but the way in which you interpret log data and pair it with other sources of observability must change, too, to keep pace with the unique demands of cloud-native environments.
This bog post explains how to handle this challenge and ensure that logging advances, rather than hinders, your team’s overall observability strategy. We’ll discuss:
- The special logging challenges that have arisen from cloud-native environments
- What makes modern logging different from conventional logging
- How best to approach log analytics and management for cloud-native apps
Monoliths and traditional logging
Traditionally, logging was one of the most straightforward parts of application performance management workflows, thanks to the relative simplicity of logs and the application environments that produced them.
When applications ran as monoliths and were deployed on bare-metal servers or virtual machines, there were relatively few logs to deal with. In most cases, teams just needed to collect and analyze application logs and operating system logs.
The fact that applications and operating systems typically wrote logs to reliable, persistent storage also simplified the process. Logs were stored on disks and were easy to retrieve from centralized locations, like /var/log. The “tail -f” command was a sysadmin’s best friend.
Logs were also comparatively simple in structure and format. Although different applications produced different types of log data, and sometimes structured it in different ways, it wasn’t usually difficult to transform log data, when necessary, in order to aggregate data from multiple sources into a consistent format. More savvy organizations even took time to think through semantic logging patterns and naming conventions so developers could make logs from their applications more valuable.
(Read our primer on log management.)
The challenges of cloud-native logging
In the cloud-native world, the nature of log management has changed drastically, for several reasons.
There are simply many more logs to manage. Instead of having to collect and analyze just one log per application and another for each server, teams must now aggregate and analyze data from dozens of distinct logs.
Indeed, in a cloud-native environment composed of dozens of microservices, hundreds of containers, and multiple layers of infrastructure, you could have hundreds or even thousands of logs to manage. Each microservice instance could produce its own log. Your orchestration tool, like Kubernetes, typically produces multiple logs for its various components. Each node in the cluster produces its own logs, too.
(Learn more about Kubernetes logging.)
Put simply, the sheer number of logs that teams must manage in a typical cloud-native environment has multiplied by a factor of ten, if not a hundred, compared to the days of monoliths. By extension, the cost of storing, indexing and processing logs has also increased.
Ephemeral log storage
In many cases, logs in cloud-native environments are stored by default in locations that are not persistent. Containers typically write logs to their internal file systems, and those logs will disappear permanently when each container shuts down unless you move them somewhere else first. Server logs stored on nodes that crash or go offline may also become unavailable unless they are aggregated to a more reliable storage location.
The ephemeral nature of log storage in cloud-native environments means that teams must continuously collect log data to persistent storage. Pulling logs every hour, or even every minute, is not enough to guarantee that all log data remains accessible when you need to analyze it.
Diversity of logs
Cloud-native environments produce a dizzying array of different types of logs. Not only do applications and servers log data, but cloud services, APIs, orchestrators, and anything else running in your environment likely produces its own logs, too.
What’s more, these various components store logs in a variety of different locations. Instead of simply pulling data from a central location like /var/log, you have to collect logs from each individual container instance, node, API gateway and so on.
Proprietary log tooling
All of the major public clouds provide logging services. However, they are mostly proprietary solutions that work only within the specific platform that provides them.
If you rely solely on your cloud provider’s logging solutions to manage and analyze log data, you are at risk of being locked into that vendor’s platform. It’s also difficult to manage logs across a multi-cloud architecture, unless you are willing to juggle multiple cloud vendors’ logging tools.
Best practices for cloud-native logging
Faced with these challenges, IT and SRE teams must rethink their approach to logging. Cloud-native logging is so much more complex than traditional logging that simply scaling up existing logging workflows, or expecting cloud vendors’ log management tools to handle the complexity for you, doesn’t suffice.
To deliver effective insight into cloud-native environments, modern logging strategies must be founded on the following principles.
Logs should be only one component of your team’s overall observability strategy. Some types of data — such as information about why a specific request fails, or the real-time health status of your environment — can be better collected via other means.
That’s why data from distributed traces, metrics, and any other information you can collect about your cloud-native services and applications should be integrated and correlated with the insights you derive from logs. Log data must be integrated with tools designed for metric and trace analysis. It’s only by contextualizing log data with other types of information in an end-to-end observability platform that you can gain a full, end-to-end understanding of what is happening in your environment.
(Compare observability & monitoring.)
When you collect, transform and analyze log data, it’s best to rely on open standards and frameworks — or vendor-specific tools that are based on those standards — rather than tooling that takes a proprietary approach
A logging strategy where you ingest, transform and analyze logs using open standards not only helps avoid lock-in, but also maximizes your ability to work with log data of most types and origins. Community-developed standards are more likely to support a broad set of log formats than are proprietary solutions that cater only to one vendor’s own ecosystem.
(See how OpenTelemetry is solving this issue.)
You shouldn’t have to write complex queries or code to power custom visualizations every time you need to analyze a log. Instead, look for tools that offer no-code interfaces for exploring log data, and tools that make it easy to customize your analysis without having to write hundreds of lines of custom code
As noted above, cloud-native environments produce a variety of different logs, each stored in a different place. It’s critical to break these logs out of their individual silos and analyze them collectively.
In other words, don’t analyze container logs separately from node logs and orchestrator logs, for example. You should integrate and correlate all of that data into a complete observability platform in order to gain full visibility. You may need data from one type of log to interpret an event that is recorded in a different type of log.
Contextualize and shape log data
Again, log data on its own provides limited visibility into applications and environments. In order to use logs to maximum effect, teams should contextualize logs with other data — metrics, traces and any other sources of visibility that their environments expose — within a complete observability platform.
Contextualization helps to correlate data and pinpoint the source of problems more quickly. It also unlocks richer opportunities to filter and search large volumes of logs by using insights from other data sources to determine which queries to run and which fields to parse.
Use logs broadly
Don’t treat logs only as a way to audit your environments, or only as a debugging resource for developers. Use them to:
- Troubleshoot and optimize application performance
- Manage capacity planning
- Enhance security
- Help developers understand what is happening in production
- Correlate transaction payload data with operational data
In order to leverage logs for a variety of purposes, you must be able to work with logs in multiple ways. Not every log will need to be stored, transformed or indexed in the same manner. Your approach to handling each log will vary depending on your goals and use cases, and your tooling should reflect that.
Splunk solves observability for all
Application environments have changed, and so has the nature of the logs they produce. Today’s teams must manage and analyze logs on a scale and level of complexity that is unlike anything they experienced in the past. To thrive in the face of this challenge, businesses need flexible, standards-based logging tools that integrate log data of all types seamlessly into an end-to-end observability platform, where teams can maximize the value that logs provide — which is precisely what Splunk Log Observer offers.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.