Systems going down because of an unforeseen incident? Got problems with your app or website? Is your audience missing out on products and services because your load times are too slow?
Then monitoring and observability (and telemetry) should be of interest to you!
In this long article, we’re covering everything! I’ll start with the concepts and how they work. Then I’ll move onto the real-world stuff that brings it all together — tools and examples so you can ensure the reliability of all systems that power your business.
TLDR: monitoring vs observability vs telemetry
The quick summary:
- Monitoring is an action you take. Monitoring a system, an app, a certain metric to pick up on anomalies that might indicate an issue. It answers the question “Is this system working correctly?”
- Observability is a property of your system, not an action, that helps you control complexity. It answers the question: “What is happening inside this app or across a system?” While observability incorporates monitoring activities, it goes far beyond mere monitoring.
- Telemetry is simply the supporting pillars. Telemetry refers to telemetry data. Logs, metrics and traces are what power observability but these three items on their own do not make a system observable.
Keep reading for more in-depth, expert information.
What is monitoring?
A simple concept that’s sometimes harder in practice, IT monitoring includes any activity that supports and ensures digital equipment and services are working properly. Monitoring helps IT professionals to detect issues — and possibly help resolve them. From a systems standpoint, monitoring can help with anything that you might ask, “Is this [system, app, network, etc.] working correctly?”
IT monitoring is a catch-all phrase, but the monitoring activity gets more specific depending on the specific use case (area) you need to monitor. Overall, IT monitoring can play a part in all areas of digital and IT services:
- IT operations and IT service management, including availability/uptime and SRE functions
- Cybersecurity including security orchestration & automation response (SOAR) and security incident and event management (SIEM)
- Software development and DevOps activities
- Overall operational intelligence
Though it will always depend on the area you’re monitoring, we can sum up monitoring as collecting and analyzing predefined data types (network bandwidth, CPU utilization rates, etc.) in order to detect abnormal behaviors that might indicate problems.
(Get all the details in our IT monitoring explainer.)
Types of monitoring
So, what sorts of IT areas can you monitor? Well, practically everything! See if your website is up, make sure your infrastructure has capacity for all its workloads, ensure APIs are responsive, identify security risks. You can monitor all these particular areas and probably a lot more:
- Availability monitoring (often associated with uptime) which can also cover network monitoring, server monitoring and infrastructure monitoring
- Application performance monitoring (APM)
- Web performance monitoring, including site reliability
- API monitoring
- Real user monitoring (RUM) and synthetic monitoring
- Security monitoring and network security monitoring
- Business activity monitoring
Tool types for IT monitoring
Within monitoring, we can sum up the types of tools into three main types:
- Observational tools can look at hardware, software and services and report on how or whether they are operational
- Analysis tools take the observations and might determine where—and why—problems are occurring. These can overlap with AIOps tools, which can predict problems proactively by looking at historic patterns.
- Engagement tools are the premium version of monitoring tools because they can act on information that the other tools only report on.
(Yes, Splunk has a variety of monitoring tools. Explore them now or read on to the observability section for a fuller understanding.)
Challenges with legacy monitoring
So, yes, monitoring has been around for decades and today it remains important. But with distributed systems (and distributed workers!), traditional monitoring does have clear limitations.
Today, most enterprises are using containers, microservices and Kubernetes in some capacity — these cloud-native technologies enable flexibility and agility and accelerate time-to-market. But, of course, they are too complicated for legacy monitoring approaches. There’s a few reasons for this, as Spiros Xanthos describes:
- Data gaps. Traditional monitoring tools might only sample data, limiting the full visibility for both the users and any analytics algorithms running on that data. The result is lower visibility into customer-impacting issues, which translates to longer time to resolving the issue.
- Slow movement. Serverless functions inherent in cloud-native tech are invoked within seconds or less. Traditional monitoring tools can’t pick up action this quickly, contributing to more missing data.
- Missing intelligence. Most monitoring tools weren’t built to take in the rate of data we’re accustomed to today. What data they do take in, the tools aren’t equipped with built-in intel, resulting it too many alerts and not enough actionable insight.
- Tool sprawl. As I showed above, monitoring can apply to practically any digital area—which means there are too many tools to overlay and integrate. This is another missed opportunity.
Now, let’s turn to observability, which specifically aims to address these legacy challenges.
Observability explained: what being observable means
Where monitoring is an action you take, observability is an overall function or property of a system. The more you can observe a system, the more you can understand the complex ways. We no longer have to assume that various integrated services are a “black box” that we cannot see into.
But I encourage you to think more creatively about what this can mean. As Greg Leffler, head of observability practitioners, puts it:
“Observability is a mindset that enables you to answer any question about your entire business.”
How observability differs from monitoring
Monitoring contributes to a system’s overall observability. So, with monitoring, you might be asking “is an individual piece (network, website, application or other service) up and running as expected?” With observability, you’re asking a bigger question: “How well is everything working?”
Previously, monitoring might alert you that your server’s CPU is spiking...but it can’t tell you which pod or container to go to, let alone if the spike is even something you need to worry about. So, no longer do we have to say “this system is too complicated to understand”. With observability, we can know so much more.
One real-world example shows exactly this difference: PUMA uses Splunk to do a lot more than simply knowing if their sites were up or down. After all, uptime is the starting point — uptime alone doesn’t make a website or business succeed.
“Before using Splunk, PUMA’s basic monitoring capabilities could only indicate whether its e-commerce sites were up or down. This meant DevOps and business teams couldn’t detect critical issues that caused failed orders, such as unresponsive inventory systems or declined credit cards. The result was a significant number of missed sales opportunities.”
Observability relies on external outputs. Yes, observability does rely in part on your monitoring practices and chosen metrics. It can also see “unknown unknowns” that monitoring cannot see. Let’s see how this came about.
Brief history monitoring & observability
Like monitoring, the concept of observability has been around a long time. Observability dates to academic research from the 1960s — but it has only been much more recent that observability has entered into the wide world of IT. We can point to two key drivers in the “sudden” interest in observability over the last decade or so:
- The exponential growth and adoption of distributed systems, including microservices, containers and serverless functions, which are inherently more complicated to monitor.
- The emerging capability to orchestrate technology across silos and integrate their insights for faster business operations.
Ultimately, observability can help system administrators understand unpredictable situations, most common in the distributed systems enterprises today support.
How observability works
For a system to be observable it requires two things: plenty (!!) of data as well as the tools necessary to aggregate and operate on that data.
Observability relies on three types of telemetry data: metrics, logs and traces. With this information, teams see deeply into complex systems, allowing them to investigate the root cause of many, many issues — that alone, monitoring wouldn’t point to. When a system is truly observable, teams can…
- Monitor modern systems in more effective ways.
- Connect knock-on effects in complex chains, tracing them back to the root cause.
- Enable visibility into an entire architecture, breaking down silos.
What is telemetry data?
You’ll often hear the word “telemetry” associated with observability and monitoring. This is not a separate concept, but a supporting one: telemetry data is what enables a system to be truly observable. Telemetry data refers to the logs, metrics and traces in observability — what is sometimes called “the three pillars of observability”.
- Logs are text records of events that include a timestamp (when the event happened) and a payload that offers context. Often logs are simple plain text, though they can be structured or binary, too.
- Metrics are values measured over time. They’re structured by default.
- Traces indicate the end-to-end journey of a request across a distributed system. That request shows every operation performed on it. This is also known as distributed tracing.
It’s important to understand that telemetry data enables a system to be observable — but these three items alone do not add up to observability. For that, we want to look at additional features we can layer in.
Features and tools for observability
When moving from monitoring towards observability, you don’t have to tear down everything and start from scratch. You could decide to take what you’ve already had and complement them with in-house or open-source software to bring them to an observable state. Of course, you can also look into an end-to-end observability solution (more on that later). So, what goes into a truly observable system?
Typically, four components are required to implement true observability:
- Instrumentation tools collect telemetry data from various components, including the host, the application, the service, the container and more. Likely these tools utilize a framework like OpenTelemetry to exposed telemetry data.
- The ability to process and correlate the telemetry data, which offers context, enables automation and supports custom data curation and visualization. The correlation is vital here: if you’re analzying data in isolation, you’re already limiting the insights you can possibly glean. Only when linking and looking collectively at data will you get full context.
- Root cause analysis enables observability to trace issues to their root cause. Here’s where distributed tracing is vital—without it, you’re only finding anomalies, as in traditional monitoring.
- Automation, including incident response actions that support incident management and ongoing incident automation. These actions aim to inform the right people — who’s available, on-call and has the right skills to support.
- Machine learning and AI operations that can automatically correlate and prioritize incident data. This enables the system to filter out alert noise (a notorious monitoring problem), so you can accelerate incident response for the incidents that most require it.
Benefits of observability
If there’s one sentence to sum up all the benefits of observability, it’s this: Cloud complexity is easier to handle when you have true observability. Organizations today have hybrid architectures across the multicloud, plus hundreds of microservice-based apps. Complexity with little visibility — talk about burnout for every single one of your IT workers.
We conduct annual research into the global state of observability. Our most recent research from 2022 indicates that companies that lead at observability see benefits like:
- Better MTTR
- Lower downtime costs
- More product or feature launches
Maturing in these areas also has knock-on effects like achieving true digital transformation, building resilience and attracting and retaining top talent.
Observability use cases
Observability isn’t limited to one single improvement area—nor is it limited to helping a certain set of stakeholders. When you’ve matured to a truly observable IT organization, you can see benefits in all sorts of areas, including:
- Solving application performance issues faster. With smarter, integrated monitoring, you can improve uptime and performance, minimize MTTR and optimize resource utilization.
- Automating more processes for more teams. Common beneficiaries here are operational teams, dev teams, and the overlapping areas.
- Building security and resilience. The output of observable architecture — observable data — supports helps in cybersecurity, DevSecOps and SRE practices.
- Delivering a fantastic end-user experience. Good websites, useful apps that work and quick issue resolution all result in a better business brand.
Today’s top observability tools and solutions
Observability products are designed to help developers, IT teams and other stakeholders monitor and manage complex systems, apps and infrastructure.
These companies today offer the most well-known observability solutions, all with their own features and capabilities — and inherent limitations. For example, some solutions focus solely on cloud-native environments, and others offer only distributed tracing or log analytics. Not all offer real-time streaming, either. Common observability tools on the market today include:
- Splunk (That’s us! More in the next section)
- New Relic
Depending on the specific needs and requirements of a particular organization, one or more of these observability products may be useful for improving visibility and managing software systems.
With Splunk Observability, you’ll solve problems in seconds. Our observability solution is the only solution available today that’s full-stack, analytics-powered and OpenTelemetry-native.
Splunk Observability has all the must-haves for observability: instrumentation, data correlation, root cause analysis, automation and machine learning. It also offers some features a lot of other do not have:
Real time streaming. Today, the difference between minutes of latency and seconds can mean a lot. Splunk Observability is built on real-time streaming architectures, enabling you to detect and alert critical patterns in mere seconds—no matter the data format or data structure.
Massively scalable. For large organizations and global enterprises, scalability is essentially. Splunk Observability meets your needs no matter how large or how complex those needs are. How much scalability, you ask? Petabyters of daily log ingest and millions and metrics and traces per second—with no performance or response decreases.
With Splunk observability solutions, you can:
- Get insight into cloud-native, microservice and monolithic applications with NoSample distributed tracing and code-level visibility.
- Improve hybrid cloud performance with instant visibility and real-time alerts.
- Ensure service performance with full visibility, AIOps and incident intelligence.
- Start investigating application and infrastructure logs in minutes for the "Why?" behind software behavior.
- Find and fix customer-facing issues across web and mobile with full visibility into the end-user experience.
- Proactively spot and resolve performance issues across user flows, business transactions and APIs.
- Make on-call less frustrating and improve business outcomes with automated incident response.
- Connect on-call DevOps teams to the actionable data they need to diagnose, remediate and restore services faster.
Splunk utilizes and supports OpenTelemetry
We’ve touched briefly on the OpenTelemetry framework. Because no commercial vendor has a single platform for collecting data from every one of your applications, OpenTelemetry was developed to solve this problem. This framework standardizes the way telemetry data is collected and moved to data platforms, like Splunk.
In addition to this major problem solve, OpenTelemetry has some knock-on benefits, too:
- Engineers no longer need to refactor code or install proprietary agents every time a backend change comes along.
- OpenTelemetry will continue to work as new technologies emerge. In contrast, commercial tools mean vendors have to first build new integrations for interoperability.
(BTW, we’re proud to have donated the OpenTelemetry eBPF collector.)
A real-world example: From monitoring to observability
To illustrate how observability moves far beyond monitoring, let’s look at Rappi, who successfully maximized observability. With the global pandemic, Rappi saw a 300% surge in on-demand orders across 250+ cities in Latin America. Today, they service 7.5 million active users per week.
So how do they ensure their mobile app, infrastructure and backend services stay available and reliable for their customers? They turned to Splunk Observability Suite to:
- Fix issues more than 90% quicker than before.
- Improved overall uptime and performance, which means fewer issues.
- Free up incident-handling time so app developers can release new features and versions biweekly — a major win for sustainable growth
As you can see, moving to observability is a journey that results in overall business growth and resilience.
For self-service support with Splunk monitoring and observability, check out these resources:
- Splunk Lantern, where you can self-serve your way to achieving business use cases with Splunk products.
- Splunk Docs, where you’ll find all the technical specs for our products.
- Splunk Training & Certification, where you can take a variety of courses or follow learning paths towards Splunk expertise.
- Splunk Community, where you can ask questions and find answers to your questions.
- Splunkbase, where you can download free apps that plug more environments into Splunk.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.