When meeting with a current or prospective Splunk customer, one question we are often asked is “Why do I need Splunk when I can just use AWS Cloudwatch, Azure Monitor, or GCP Cloud Operations Suite (formerly known as Stackdriver) for my cloud monitoring needs?” And what a great question it is!
Cloud provider monitoring is often perceived as “cheaper”, “easier” or “more integrated”, so there’s usually an inclination, particularly early on in your cloud journey, to rely heavily on those tools. We’ve seen this approach attempted over and over again, and trust us, it’s a big mistake.
In this blog, we’ll help you understand why Splunk is necessary; we’ll provide the “big picture” answer to this question, and back it up with five insights, with solid examples, you’ll never get using just your cloud provider alone!
The “Big Picture” Answer
Ok, so why do you need Splunk, and not just your Cloud provider’s monitoring solution?
The simplest answer I can provide is this… cloud provider monitoring solutions often just feed yet another data silo where data often gets locked away, unable to be consumed by the masses, and correlated with other data sources. In addition, the features and functionality in each cloud provider's monitoring solution are limited, and it would seem that Amazon, Microsoft and Google strive for “just good enough” when it comes to functionality and usability.
This isn’t a new problem either; for years Splunk has been the de facto standard solution for liberating and correlating data. Take for example VMWare. Sure VMWare administrators may be happy and comfortable with using VSphere to monitor, administer and troubleshoot the VMWare platform. But what about the application owner whose application runs on a VMWare Virtual Machine? Chances are they aren’t logging into VSphere when something is wrong with their application. However it’s quite possible their application problems occurred because of an unexpected VMotion event or a noisy neighbor VM! The same can be said for databases, networking, storage and every other technology. Without the ability to pull data from multiple systems into one place, you’ll always struggle to correlate, investigate and get to the true root cause when incidents occur.
If that big picture answer hasn’t convinced you yet, let me back it up with five solid examples of insights you’ll struggle to get using your Cloud platform monitoring.
Insight 1: Visibility Across Multiple Cloud Regions and Accounts
If you’re like most organizations, you manage multiple cloud accounts and deploy resources across multiple regions, and sadly you only get the “full visibility” of your infrastructure on your billing statements, if you catch my drift. Moving account by account, region by region within the Cloud provider console to gain insights about the infrastructure is a painstaking process. However, using Splunk, you’ll be able to easily answer critical infrastructure questions across accounts and across regions such as:
- Which accounts are deploying large amounts or costly infrastructure?
- Are we consistently and effectively tagging infrastructure to attribute service, team, and owner, and other information that helps us understand and manage cost?
- Have we inadvertently deployed infrastructure in a region where we should not be conducting business or storing data?
Figure 1: Inventory of all Azure VMs across accounts grouped by region
Figure 2: A Splunk AWS EC2 costs analysis dashboard
Insight 2: End to End Troubleshooting of Hybrid Applications
If you’re like most organizations, you’re migrating workloads to the cloud, but these workloads still have on-prem and external dependencies. In fact, the workloads that remain on-prem are usually among the most sacred, business critical systems a company depends on. Whereas workloads migrated to the cloud are often phased, with lower risk components going first, and in most cases they still call back to external dependencies to fulfill their responsibilities.
Imagine trying to troubleshoot an application like that. Oh wait, you’ve probably already had to! And let me guess how it went, the cloud operations team was using a particular cloud provider monitoring solution which reported “all normal” while the IT operations teams used their monitoring tools, which also reported “all normal”.
From what we’ve seen at Splunk, when organizations are able to solve this complex problem, it’s because they’ve centralized down their monitoring and investigations into a single platform; where Splunk searches, dashboards and visualizations become the “universal language” that teams use to triage, isolate and investigate incidents.
Figure 3: Multi-Cloud service health scores shown in a Splunk Observability Executive Glass Table
Figure 4: A Splunk dashboard showing private cloud hosts metrics and AWS instance metrics
Insight 3: Monitoring and Alerting at Per-Second Granularity
Cloud providers (understandably) impose limitations on the type and granularity of monitoring data you have access to. For instance, AWS Cloudwatch metrics are collected at a default interval of once every 5 minutes. At best, you can increase the granularity to once per minute, but not more often. Imagine your cloud provider like a babysitter tasked with watching your kids. Would you be comfortable with them constantly on their cell phone, glancing up once every 5 minutes to “check on the kids.” No thanks! A lot of chaos can happen in 5 minutes.
In addition to granularity, there’s likely more telemetry about a device that you’ll want to collect which may not be supported by your cloud provider’s metrics. I’ll pick on AWS just a little bit more here, EBS storage is priced by total disk size allocated, regardless of how much space is actually being used. So, let’s save on our cloud cost by using Cloudwatch to visualize and monitor the percentage of storage in use for each EBS volume right? Wrong! While you can easily see how large the volume is, there isn’t a metric that tells you how much is in use. Ouch!
Figure 5: Graphic showing Splunk Open Telemetry (OTEL) metrics in comparison to AWS CloudWatch metrics
Insight 4: Advanced Investigations and Analytics with Powerful Search Capabilities
Because Splunk absolutely dominates this area, I’m going to come right out and say it... Searching and analyzing logs in your Cloud provider’s monitoring tool will quickly leave you disappointed and needing more. Basic collection, basic searching, and basic insights are possible, but in depth investigations and environment wide coverage requires advanced capabilities for data collection, parsing and searching only found in Splunk. Let’s give a few specific examples.
Suppose you’d like to determine if a recent code deployment is causing new issues in the environment. One powerful analytical technique is to search for newly seen stack traces, but this can be tricky. How do you determine when one stack trace is the same as another? With some clever SPL, we can easily detect that.
Suppose you’d like to determine if any S3 bucket data has been inadvertently configured as publicly available. Without Splunk, this can be tough! But with Splunk’s powerful search language, we can collect S3 access logs, enrich them with geo-location lookup, and plot them on a map by S3 bucket name. Wow!
Figure 6: AWS S3 geo-location lookup and plotting shown in a Splunk dashboard
Figure 7: Stacktrace search detail in Splunk
Figure 8: Detect and trend stacktraces in Splunk
Insight 5: Faster Time-to-Value with Pre-Built Content
Finally, it should come as no surprise that getting the data is only the first step. How you use the data is really where the value is. While cloud providers make data collection from their services fairly easy and often provide domain specific dashboards, correlating and viewing data across different services is very much a manual process. Additionally, every application and its underlying architecture is unique which means it can be an arduous task creating meaningful visualizations that include many different sources of data.
Splunk Infrastructure Monitoring (SIM) and Splunk Application Performance Monitoring (APM) provide out-of-the-box workflows for your critical data regardless of where it originated. This enables immediate usage with context based drill-downs, speeding up ROI and enabling you to get the most value from your data. Also, the out-of-the-box dashboards can be used as templates for streamlining the process of going from a domain specific dashboard to a highly correlated end-to-end view of your mission critical services. Charts and other visualizations can easily be copied from disparate dashboards and pasted to persona based, highly actionable dashboards. The dashboards can then be shared with others and included in team based collections increasing collaboration.
Figure 9: AWS built-in dashboard groups in Splunk Observability Cloud
Figure 10: Google Cloud Platform built-in dashboard groups in Splunk Observability Cloud
Figure 11: Azure built-in dashboard groups in Splunk Observability Cloud
The primary objective of Cloud providers is to deliver infrastructure, platforms, and software as a service. Delivering effective monitoring of that infrastructure is, at best, a secondary objective. We’ve seen it for years, as new technologies emerge, and there’s no reason to believe cloud providers are different or special. At Splunk, our primary objective is to effectively collect, search, analyze and monitor every type of machine data so that you can turn data into doing.
To learn more about how Splunk can help your IT, CloudOps and DevOps teams to proactively observe and analyze your evolving cloud estate check out our Splunk Observability Cloud page.