The Five Tenets of Observability
In this post, I’ll briefly explain what observability is, what a system needs to actually provide you with true observability, and how you can start the observability journey yourself.
Observability is a mindset that lets you answer questions about your business — from the user’s experience, through the application itself, and beyond to the business metrics and processes that the application enables. It’s an evolution of monitoring that greatly expands the volume of ingested data and radically expands the number and type of questions you can answer. It’s not just “metrics, traces, and logs” – observability is really about instrumenting everything and using this data to make better decisions. I wrote more about this in a different post, Observability: It's Not What You Think, that I’d encourage you to check out for an observability deep-dive.
Before I came to work at Splunk, I was an SRE (well, a systems admin at one of my jobs, but I’m old.) I know first-hand how important enterprise-grade observability is, because there are plenty of problems I solved in the past that I wish I had been able to use an observability system like the one we sell at Splunk to dive into. In the rest of this post, I’m going to discuss five things that an observability system must do to make it worth your investment, and I’m also going to give some examples from my experience in operations as to why these are critical.
What Differentiates One Observability Product from Another?
Every vendor will tell you that by buying their product and installing it you instantly ‘get’ observability, and in every case, including buying the product from us, this isn’t true. What you get out of the box varies a lot, however. When you’re thinking about what an observability solution will get you, you need to think of a few things that aren’t necessarily going to be published on the website or discussed in reviews. In the next section, I’ll discuss what I’ve found to be the five key tenets for an observability system. These apply to any system – commercial or homegrown – and make a real difference in how you can get value from an observability migration.
The Five Key Tenets of Observability
When evaluating an observability system, here are the five key tenets of Observability: Full-stack, end-to-end visibility; real-time answers; analytics-powered insight; enterprise-grade scale and features; and open standards. Let’s dive in to each of these in more detail:
Full Stack and End-to-End
Adopting an observability platform that can’t give you 100% visibility into all your transactions, from the user browser’s, through your application, to the underlying business platform is setting yourself up to miss something critical. This includes support for things like RUM to determine user browser behavior, but also this includes avoiding sampling - read this post to learn why sampling is an antipattern in observability. In addition to the user’s experience, you’ll also need to have insight into the backend performance, including things like database query performance or code profiling.
I can’t count the number of issues I had to troubleshoot at LinkedIn brought on by someone important firing off a bug report to the sre@ email list – at that point, you simply have to figure out what happened and fix it. If our tools at LinkedIn hadn’t been able to see the end-to-end history for all our users, I may not have been able to fix those issues at all, or it would have taken much longer than necessary.
Real Time
In one of my early tech jobs, we found out about a problem via phone call from the CTO before any of our alerting told us it was a problem. While he was explaining the issue, alerting started to fire, but by that point, the issue had already been happening for close to 15 minutes. We hit bad timing with when the problem happened, but this could easily happen to anyone.
Analytics-Powered
The volume of data generated by an observability system is astronomical. There’s no way around it – you need something to help you make sense of this data and to suggest things that matter. An observability platform has to make problems easier to solve, not more difficult. Just instrumenting and adding tons of data into a system with no way for it to surface important things is going to make your problems worse.
Adding additional information to an observability system can backfire on you without a way to analyze it. In one of my past jobs, nearly every service ran in a JVM, so of course, it made sense to collect JVM memory statistics and to then alert on excessive memory usage, GC pause time, and things like that. What we didn’t anticipate when adding these metrics was how many events would be generated by small problems in one application. The alerting tool had no dedeuplication and there were thousands of events to manually clear every time the workload changed enough to alter memory allocation patterns in one app. These patterns didn’t have any user impact, the app was just behaving differently to us. A good analytics tool would have at least deduplicated these, and at best would have indicated that these aren’t impacting any customer-facing metrics so aren’t worth a realtime investigation.
Enterprise-Grade
Yes, I know that we’re dealing with buzzword city whenever anyone says “enterprise”, but a robust observability system has to do many things that go beyond simple monitoring. Your system eventually will probably need to operate across multiple clouds (and probably a few on-premise systems.) You’ll start to rely on it, so it needs to keep running no matter how much you grow and no matter how many services you have. Eventually as you get even larger, true ‘enterprise’ features like RBAC and access tokens and accounting will be needed. The worst outcome would be needing these features and them not being available, requiring a time-consuming shift in observability tools unnecessarily.
Open Standards
What This Means for You
To start your observability journey, you want to make sure that whatever platform you’re choosing can deliver on these five key tenets. Splunk Observability Cloud is built to deliver on these, in addition to providing a single place to view your entire operation, from an on-premise monolith to a globally distributed Kubernetes world, observability-as-code support through Terraform, and more.
You can start a free trial with no credit card required and experience it for yourself, or check out a demo on the product overview page.