PLATFORM

Face the Unexpected with the Stability and Resiliency of Splunk Cloud Platform

Stability and resiliency of cloud services are top of mind for organizations today. Whether rising to the challenge of a surge in pandemic-driven demand, or fire fighting an unexpected outage, you still have to support your own customers. With the Splunk Cloud Platform service, you have a dependable partner focused on stability and resiliency that can help to quickly investigate, troubleshoot and resolve impacts caused by massive industry-wide outages, internal security vulnerabilities, or user error. 

The ongoing pandemic is accelerating an already fast-paced move to the cloud, and growing complexities in the security landscape continue to push stability, resilience and recovery to top of mind. At Splunk, we’re laser-focused on helping customers mitigate the risk that future incidents hold. It’s in our DNA to prioritize the stability and reliability of the service in order to help customers investigate and solve problems fast.

Splunk Cloud Platform Reliable, Available and Scalable 

Splunk Cloud Platform has an “always on," high availability commitment. From infrastructure management to data compliance, Splunk Cloud Platform is built to scale to your data analytics needs, ranging from GBs to PBs and beyond. Designed to facilitate sudden bursts in data volume, Splunk Cloud Platform allows you to incrementally upgrade capacity while retaining security by design. We provide dedicated cloud environments available in AWS and GCP for each customer as well as encryption in-transit and optional encryption at-rest. We are continuously evaluating and adding new international standards.

How?

Splunk Cloud Platform offers impressive resilience, high availability and disaster recovery. Splunk Cloud Platform is built to be ready when things go wrong – and help fix them as fast as possible. The product team at Splunk has built-in innovations to provide business continuity for our customers. 

Stability and Resiliency for Our Customers

Customers expect reliable, highly available service – what Splunk provides. Splunk Cloud Platform is designed for: 

1. Reliable data-in-transit by using multiple queuing strategies including:

  • Separation of ingest and index (persistent queueing) in the Splunk Cloud Platform boundary, as part of the reimagined Splunk architecture in Victoria Experience
  • Forwarder queueing to prevent data loss by persistently queuing data at its source and retrying if the indexer is down or there are network issues.

2. Reliable data-at-rest and track availability using several key strategies:

  • Replication across availability zones (AZs) helps to prevent data loss by reducing the possibility of a single point of failure during ingest 
  • Load balancer indexer randomization helps to prevent high impact data loss scenarios in case one of many indexers goes down. The load balancer also helps to decrease indexer overload, facilitates resilient randomization, and improves ingest scalability, as part of the reimagined Splunk architecture in Victoria Experience
  • Triple data replication for redundancy in the indexer layer.
     

3. High search availability through:

  • Auto-duplication of indexers and replacement in case of failure reducing the opportunity for a single point of failure
  • Load-balanced access to search tier via Search Head Cluster
  • Nightly configuration backups.

4. Prioritized availability for mission and business critical needs through:

  • Scalable, flexible indexing providing high resiliency to spikes in ingest and search patterns, helping to ensure that high priority, business critical searches are not skipped and do not fail, as part of the reimagined Splunk architecture in Victoria Experience 
  • Replication factors in indexing designed to produce high data availability and prevent skipped searches
  • Search head clustering at the platform layer to prioritize search availability in case a search head goes down.
     

Use Splunk to be Proactive About Downtime

Detect problems before they happen, in real time.
With Splunk Cloud Platform, stream, analyze, monitor and search any kind of data in real time to detect and prevent issues before they happen. Plus, respond anytime and anywhere with Splunk’s mobile apps and augmented reality capabilities. 

Get to the root of the issue – FAST.
With unified access to all your data sources in the Splunk Cloud Platform, you can investigate the root cause of issues across all your data and uncover previously inaccessible business insights. 

Problem solve in a jiffy.
Splunk Cloud Platform allows you to maximize your team’s efficiency by getting the most value from limited resources. Go live in as few as two days and minimize delays in change management processes for upgrades. When you’re ready, expand your Splunk deployment quickly — multiple TBs of incremental capacity are typically available within two days. Let Splunk take care of the infrastructure management and administration. 

At Splunk – We Use Splunk

We trust the operational excellence of Splunk and use it to detect problems before they happen, in real time. We currently use Splunk Cloud Platform, IT Service Intelligence Cloud, Splunk On-Call and an in-house integration with our in-company communication channels to make sure the right teams are ready to tackle incident response and management. We learn fast through iteration, reviewing data to ensure things are running smoothly

“Here at the Splunk NOC, we currently use Splunk on Splunk to track, maintain, and troubleshoot Splunk SaaS logins, scheduled and ad hoc search success, data ingestion and index success, and API function and availability - all to deliver the best possible experience to our Splunk customers.”
Brenden Reeves, Splunk NOC

Here are some ways we currently use Splunk Cloud Platform:

  • To track complete, valid Splunk SaaS logins. We use Splunk to monitor Splunk Cloud Platform logins and authentication success rates and investigate when things go wrong. For example, we have alerts for any unusual geography or multiple failed attempts.
  • To monitor scheduled or ad hoc searches. We use Splunk to monitor search success rates and do deep-dive investigations when failures are beyond a set threshold. We actively and proactively monitor if a variety of Service Level Indicators (SLI) drop below a threshold.
  • To monitor data ingestion and indexing. We monitor indexers to track whether they’re in the desired customer state, typically alerting customers only in outlier scenarios using machine learning to proactively identify unusual spikes and to keep from inundating customers with unnecessary alerts. If a customer requests support, we’re ready to dive into the performance and resolve the problem quickly.
  • To track availability and functioning of APIs. We monitor API services to help make sure they remain available to customers and are functioning properly. We monitor availability of the index tier to ingest (ex: HTTP Event Collector’s sourced ingest and internal Splunk-to-Splunk 9997 ports), and the availability of the search tier (ex: availability of the login page, Hybrid Search API’s ability to search cloud indexers, or the availability of search service itself via compute-negligible test searches).
     

The Splunk NOC monitors for suspicious or unexpected activity in any of these four areas, allowing Splunk to proactively reach out to customers when a potential issue is raised. The Splunk Dashboard Studio provides the visualizations that bring this all together for our NOC team allowing multiple team members to identify and quickly communicate potential issues.

“The stack overview dashboards we have in our Splunk NOC allow us to get a fast overview of the entire cluster of servers and services per customer, so that we can quickly identify and work to resolve any customer problems”
Brenden Reeves, Splunk NOC

So What?

Outages happen, security incidents happen. Splunk capabilities can help you thrive amidst uncertainty. The Splunk Cloud Platform is critical to helping our customers drive stability across their ecosystems from a security, infrastructure and application perspective. Here at Splunk, we depend on Splunk Cloud Platform’s availability and resiliency as the bedrock of our own NOC. Splunk is dedicated to helping customers deliver business resilience and mitigate future risks. Our Splunk DNA drives us to innovate to make our service accessible as a stable, reliable service that enables customers to investigate and solve problems fast. 

Garth Fort
Posted by

Garth Fort

Garth Fort is the Senior Vice President and Chief Product Officer for Splunk. With over 25 years of product management experience, Garth is responsible for evolving Splunk’s market-leading product portfolio of software and cloud services. He has a true passion for driving product roadmaps across both established and emerging categories while successfully guiding software teams through high growth and transitions to the cloud. Prior to Splunk, Garth served as a general manager for Amazon Web Services (AWS) and led innovation for customers, independent software vendors and channel partners. He also held several leadership positions with Microsoft over 20 years and oversaw the worldwide ecosystem strategy and execution for its cloud and enterprise division, including Microsoft Azure, Windows Server, SQL Server and a broad portfolio of products for developers and IT professionals. He holds an A.B. from the University of North Carolina at Chapel Hill.

TAGS

Face the Unexpected with the Stability and Resiliency of Splunk Cloud Platform

Show All Tags
Show Less Tags

Join the Discussion