Splunk is a tech company, which regularly gives us exposure to modern development practices, and the ability to implement them with our own technology. We want to share that with you. In this post, as part of a dogfooding series, I sat down with Ram Jothikumar, Head of Cloud Infrastructure & Operations for Observability at Splunk. Ram and I talk about how we support our SignalFx offering at scale, efficiently with resilience and reliability baked in.
Chris: Hi Ram, first of all please introduce yourself and your team.
Ram: My name is Ram Jothikumar and I head Cloud Infra & Operations for Splunk’s Observability Product Group. My team’s focus on four essential areas: Core Infrastructure, Site Reliability Engineering (SRE), Dev Productivity and Quality Engineering. It goes without saying that we heavily embrace DevOps principles and practices.
Chris: And can you describe the product and infrastructure a little bit?
Ram: Sure. Our observability offering includes Infrastructure Monitoring and Application Performance Management (APM). Together these technologies enable customers to get real-time insights about the health of their infrastructure and application. One unique aspect of the platform is that we do not sample data. This means we have to support data into the system at a massive scale. We do that by building the platform on a microservices architecture. The platform was essentially “born in the cloud” using cloud native technologies and deployed in Public Clouds right from the start. Both the Platform and Infrastructure are multi-cloud and architected to be agnostic to any specific Public Cloud Provider. We currently support AWS & GCP and are deployed in 5 geographical regions across 3 continents. Being multi-cloud is very important because it gives our customers flexibility and choice, while at the same time giving us competitive pricing with public cloud providers as well as acting as a competitive advantage.”
It’s exciting and challenging to be able to support this type of architecture at scale.
Methodology: “Must Start with the Right Principles”
Chris: Wow. It gives me anxiety thinking of how you would even begin to support it.
Ram: Well it must start with the right principles, right tools & right talent.
On principles, the first key principle or paradigm we embraced was to switch to using Infrastructure-as-a-Code principles and best practices. Its influence can be seen in how we deploy/modify infrastructure, configuration management as well as change management.
The second principle is that we carefully instrument SLA’s at a system level as well as at the service level. We need to ensure stability and reliability for the entire pipeline from a macro level as well as at a detailed service level. Our team's job does not stop at the macro production metrics, we have to support the consumption and delivery of metrics across all services.
And finally, the third principle is that the person who wrote the code is best equipped with solving an incident with that service. To bring that into action, we have distributed on-call at a service level. While Development teams are on-call for Java and Goland services they write, SRE & Infra team are on-call for OSS services like ElasticSearch, Cassandra, Kafka, Zookeeper and Infrastructure services like Kubernetes and Terraform.
From a tool standpoint, we use the best monitoring tool in the industry (our own product) to monitor the Production Deployments. What I mean is that we leverage SignalFx as our observibility tool as well. We have a dedicated instance of SignalFx to observe our production applications and infrastructure. This instance monitors thousands of nodes running hundreds of microservices using rich and high fidelity metrics.
Having the right talent to support such a complex platform and infrastructure is important as well. The SRE team has engineers who have experience in operating large web-scale platforms and both the SRE & Infrastructure teams consist of folks with Software Engineering backgrounds which is instrumental in building well-engineered solutions for challenges in these areas.
Metrics: “Service Level, and Contract Metrics Become the Foundation”
Chris: Yeah, I’m glad you mentioned that – it has been clear to me that high-performing engineering teams have a methodology in-place first that guides how they approach automation, and support applications. That framework allows them to support the ever increasing complexity that comes with modern architectures. But now, how do you approach monitoring and supporting these services?
Ram: Our team's expertise is aligned with operating systems at scale and having subject matter expertise in key areas. As mentioned earlier, SRE’s own OSS services that are critical to our platform and fully operate and monitor these. For the services the development teams build, SRE’s have focus areas amongst these and they work closely with the service owners to determine first, what quality and success metrics they care about, then second what metrics dependent services care about, which is essentially a contract. These usually are 2-3 key metrics (which we call Service Level Objectives) that are closely monitored at a service level. Both the service level, and contract metrics become the foundation of what we measure when the services are in production.
Chris: Are they usually unique?
Ram: Surprisingly yes, they tend to be unique on a service level often. That is because each service offers a specific subset of functionality, and the quality of that functionality has different criteria. It could be errors, duration, latency or performance based. The bottom line is to identify the high quality metrics that denote the health of the service as a whole and metrics that denote contract requirements by dependent services.
New Code: “SREs and Devs Work Closely Together”
Chris: As you said, we have a lot of services out there, but what happens if we add a new one?
Ram: Because we work from DevOps principles, we are very process driven. As a new service comes online we have a production readiness criteria that the service is measured against. It does not go into production until this happens. This is the juncture when our SRS and Devs work very closely together.
Chris: Once a new service is packaged and ready to go, what gives the confidence that it won’t break something that is already out there?
Ram: SLO’s (Service Level Objectives) and SLA’s (contracts to dependent services) come into play here. As long as these are monitored for existing services and new services add on to these, the impact is significantly reduced on adding new services. Also a healthy CI/CD system with quality gates helps with this.
Chris: How do you maintain the quality of the alerting mechanisms that use the SLO’s and SLA's?
Ram: By using Infrastructure-as-a-code practices. For example, in our platform we have detectors. They are the set of rules that watch for an anomalous event and fire a notification upon detection. While it’s possible for teams to experiment with detector logic and settings in UI; before they can end up in our production monitoring instance they have to go through the same review process that any code would. We call this “detectors as code,” detector logic is scripted, versioned, tested, and reviewed before it ready for use to monitor the platform.
Collaboration: “Be Very Deliberate About How You Operate”
Chris: Even with all that automation, it seems like with so many services, each with their own lifecycle and dev team, collaboration must be a big challenge.
Ram: Not really. When it comes to working with other teams, you need to be very deliberate about how you operate. Our processes and automation help a lot, but planning is key. The first thing is we have quarterly product planning sessions (PIs). This lets everyone in the engineering organization know what is coming. For our team this is when we get more understanding of what is being developed and on the roadmap for release. This is also when we will actively engage with a service team to start crafting a path to supporting the new service in production. Because service owners are going to be on-call that process is as important to them as it is to us. The PIs are a great way to get alignment. By having automation and processes we get to focus our attention on collaborating, and keeping on top of what is in production.
Chris: Speaking of production. Once services are live how do you visualize and share what happens? It seems like a flood of metrics and dashboards? Does everyone consume these?
Ram: That is a good point, any given service might have 100+ metrics that are collected. But the consumer of the service does not care about all of that. The consumer cares about the service doing well enough to handle their use case. So the broadly shared metrics are the key boundaries for that service. At a service level we allow each team to build their own dashboards. It sounds like it could just be dashboard sprawl. But we implement standards and best practices for those dashboards so that anyone looking at them can get a high level understanding of those boundaries. This includes having standard metrics at the top of the dashboard that give a quick glance understanding of its status red, yellow, green. As well as the owners and text explaining the use case for the dashboard.
The macro level dashboards are forefront for everyone, and owned by the SRE team. The SRE team helps drive what metrics, across all services, are key to determining status at a glance, and each component is wired up to a detector so at anypoint someone can dive in and know where the data is derived from.
Incident Response: “Our On-call Strategy is Key to Reducing MTTR”
Chris: So you said before that service owners support their code. Does that mean they are on-call for their code in production?
Ram: That is correct. Service owners will be part of the on-call rotation for their service. The obvious reason for this is they are best suited to support it, but it also makes all the independent teams operate efficiently. Our on-call schedules are grouped first by closely related service groups. It’s important to make sure that we have 24x7x365 coverage, but we all know that being on-call is not fun, so the well being of our engineers is also very important. To help with this we strive to maintain an oncall frequency of 1 week per quarter and make every effort to put engineers on-call on service groups they work on.
Our on-call strategy is key to reducing mean-time-to-recovery (MTTR) and making improvements to incident response.
Chris: Ram, your team covers a lot. Bottom line they are ensuring quality, stability, and resilience across both the application and the delivery chain. Are there other aspects of supporting the application that your team thinks about?
Ram: Well a big point, as you said, is resilience. We continuously have to think about what we do if/when there is an incident, and how to prevent incidents in the future. So the SREs also create standards for addressing incidents. This could be incident response standards, runbooks and procedures. Or it could be playbooks for specific incident types, and even automated systems as the first step to resolution. We are constantly augmenting how we address incidents with automation.
Chris: Ram this is great. Thank you for your time, I know you and your team are on-call as we speak, but I'm excited to be able to share with the world our DevOps approach and some of our strategies to manage a cloud-native application at scale
Ram: Thank you for chatting with me.