A new year is a chance to have a new start, and one thing that it’s a great opportunity to think about is the monitoring and observability platform you’re using for your applications. If you’ve been using a legacy monitoring system, you’ve probably heard about observability all over the ‘net and want to figure out if this is really something you need to care about.
In this post, I’ll briefly explain what observability is, what a system needs to actually provide you with true observability, and how you can start the observability journey yourself.
Observability is a mindset that lets you answer questions about your business — from the user’s experience, through the application itself, and beyond to the business metrics and processes that the application enables. It’s an evolution of monitoring that greatly expands the volume of ingested data and radically expands the number and type of questions you can answer. It’s not just “metrics, traces, and logs” – observability is really about instrumenting everything and using this data to make better decisions. I wrote more about this in a different post, Observability: It's Not What You Think, that I’d encourage you to check out for an observability deep-dive.
Before I came to work at Splunk, I was an SRE (well, a systems admin at one of my jobs, but I’m old.) I know first-hand how important enterprise-grade observability is, because there are plenty of problems I solved in the past that I wish I had been able to use an observability system like the one we sell at Splunk to dive into. In the rest of this post, I’m going to discuss five things that an observability system must do to make it worth your investment, and I’m also going to give some examples from my experience in operations as to why these are critical.
What Differentiates One Observability Product from Another?
Every vendor will tell you that by buying their product and installing it you instantly ‘get’ observability, and in every case, including buying the product from us, this isn’t true. What you get out of the box varies a lot, however. When you’re thinking about what an observability solution will get you, you need to think of a few things that aren’t necessarily going to be published on the website or discussed in reviews. In the next section, I’ll discuss what I’ve found to be the five key tenets for an observability system. These apply to any system – commercial or homegrown – and make a real difference in how you can get value from an observability migration.
The Five Key Tenets of Observability
When evaluating an observability system, here are the five key tenets of Observability: Full-stack, end-to-end visibility; real-time answers; analytics-powered insight; enterprise-grade scale and features; and open standards. Let’s dive in to each of these in more detail:
Full Stack and End-to-End
Adopting an observability platform that can’t give you 100% visibility into all your transactions, from the user browser’s, through your application, to the underlying business platform is setting yourself up to miss something critical. This includes support for things like RUM to determine user browser behavior, but also this includes avoiding sampling - read this post to learn why sampling is an antipattern in observability. In addition to the user’s experience, you’ll also need to have insight into the backend performance, including things like database query performance or code profiling.
I can’t count the number of issues I had to troubleshoot at LinkedIn brought on by someone important firing off a bug report to the sre@ email list – at that point, you simply have to figure out what happened and fix it. If our tools at LinkedIn hadn’t been able to see the end-to-end history for all our users, I may not have been able to fix those issues at all, or it would have taken much longer than necessary.
A good observability platform must give you insights and data in real-time. If you have to wait for a periodic alert rollup to find out about a problem, you’re likely to hear about it first from an angry tweet or an unhappy customer. Additionally, in a serverless world, the lifetime of a function can be in the hundreds of milliseconds (or less,) so it’s critical that your platform is able to show you issues as quickly as possible.
In one of my early tech jobs, we found out about a problem via phone call from the CTO before any of our alerting told us it was a problem. While he was explaining the issue, alerting started to fire, but by that point, the issue had already been happening for close to 15 minutes. We hit bad timing with when the problem happened, but this could easily happen to anyone.
The volume of data generated by an observability system is astronomical. There’s no way around it – you need something to help you make sense of this data and to suggest things that matter. An observability platform has to make problems easier to solve, not more difficult. Just instrumenting and adding tons of data into a system with no way for it to surface important things is going to make your problems worse.
Adding additional information to an observability system can backfire on you without a way to analyze it. In one of my past jobs, nearly every service ran in a JVM, so of course, it made sense to collect JVM memory statistics and to then alert on excessive memory usage, GC pause time, and things like that. What we didn’t anticipate when adding these metrics was how many events would be generated by small problems in one application. The alerting tool had no dedeuplication and there were thousands of events to manually clear every time the workload changed enough to alter memory allocation patterns in one app. These patterns didn’t have any user impact, the app was just behaving differently to us. A good analytics tool would have at least deduplicated these, and at best would have indicated that these aren’t impacting any customer-facing metrics so aren’t worth a realtime investigation.
Yes, I know that we’re dealing with buzzword city whenever anyone says “enterprise”, but a robust observability system has to do many things that go beyond simple monitoring. Your system eventually will probably need to operate across multiple clouds (and probably a few on-premise systems.) You’ll start to rely on it, so it needs to keep running no matter how much you grow and no matter how many services you have. Eventually as you get even larger, true ‘enterprise’ features like RBAC and access tokens and accounting will be needed. The worst outcome would be needing these features and them not being available, requiring a time-consuming shift in observability tools unnecessarily.
OpenTelemetry is the future of observability. This is primarily because instrumentation is challenging work. To get the benefits of observability, you have to instrument all of your applications, but ideally, you would want to only instrument one time then observe from anywhere. OpenTelemetry enables this. Without an open standard, time spent instrumenting your environment is time and effort on work that you’ll almost certainly have to do again at some point in the future. With OpenTelemetry, you can change observability platforms if the need arises easily. You also have full control over what data is sent where, for enhanced customer privacy and possibly enhanced performance of your observability system.
What This Means for You
To start your observability journey, you want to make sure that whatever platform you’re choosing can deliver on these five key tenets. Splunk Observability Cloud is built to deliver on these, in addition to providing a single place to view your entire operation, from an on-premise monolith to a globally distributed Kubernetes world, observability-as-code support through Terraform, and more.
You can start a free trial with no credit card required and experience it for yourself, or check out a demo on the product overview page.