A Beginner’s Guide to Observability
Download this e-book for a complete introduction to Observability.
As we at Splunk accelerate our cloud journey, we’re often faced with the decision of when to use logs vs metrics — a decision many in IT face. On the surface, one can do a lot by just observing logs and events. In fact, in the early days of Splunk Cloud, this is exactly how we observed everything. As we continue to grow, however, we find ourselves using a combination of both.
This post lays out the overall difference in logs and metrics and when to best utilize each. We hope that this analysis will help you create a better observability strategy for your own organization.
Almost all programs emit activities that occur within their program flow in the form of a log. These logs are generally files that can be:
Either way, the logs are written to a file that can then be consumed by a logs search engine. The search engine collects all the logs and then presents results of various searches to the user.
Logs are emitted from almost every program. A good logs search engine is able to handle any type of log. That makes logs the easiest and quickest data source to get visibility into the state of your system.
Within a mature observability strategy, logs are essential for unplanned research and unique situations. They are great for security use cases because many of these involve the unexpected or single event situations. For example, Splunk’s security organization utilizes logs to quickly detect and remediate significant vulnerabilities, including the log4j vulnerability disclosed in late-2021.
Logs are also great for iterative software delivery because they allow developers to establish patterns for new behaviors or functionality in production, which accelerates delivering value to customers.
(Read our log management introduction.)
It might be tempting to think that logs can solve every use case. As the amount of data grows, however, a logs-only solution will become costly and relatively slow for a small set of regular searches, usually connected to alerts. This is because the process by which logs must be categorized and batched takes much more time and is much more computationally intensive than the metrics process, which we will cover next.
A metric is a number, usually in the form of a counter or a gauge, that the developers decide is important to the observability of their system. Most software programs start their journey emitting logs, but only in the last decade have they also begun emitting metrics from early in their inception.
We are used to metrics in our everyday life. We see them on our speedometer with our car (a gauge) or the odometer tracking how many miles the car has driven (a counter). The makers of our car decided that it was important for the driver to have awareness of this information while driving.
For developers, often the biggest challenge to incorporating metrics is twofold:
When done correctly, metrics are essential for planned scenarios and events. They deliver regular evaluations cheaply, quickly and reliably. This is because they are structured in a way, unlike with logs, that is predictable and therefore can be saved into a time-series database, which is tuned for this purpose. Operators are then able to quickly know where to start when investigating a degraded state in their systems.
However, the organization must remember that the source of reliability is not found in identifying all alerts and degraded states with metrics. With modern, iterative software delivery, one must be able to debug and investigate unplanned states and rapidly incorporate those observations into the product lifecycle. This is why metrics do play a critical — though not the only — role, in delivering observable, reliable IT services.
Metrics state the big picture of what is happening. If I’m driving the car, I can see the temperature of the motor and whether the coolant warning light is on (the metrics). However, if the car starts behaving outside of the norm, a mechanic might need to ask some unpredictable questions and see the actual event log of the car itself.
This is where logs and metrics differ, and we can summarize as follows:
(Read more about machine data.)
So, when we’re asked to weigh in on the logs vs metrics debate, we say both! Logs and metrics together create a complementary observability foundation, upon which we operate the Splunk Cloud Platform. Once we establish that foundation, teams will also want to connect parts of their system together with a tracing solution.
With these three elements, we have all the essential elements to view and connect the system in both predictable and unpredictable ways. Splunk products provide a world class implementation of these benefits, that we ourselves use every day. We hope that our experience will help you in your journey to solve problems easier and faster with data.