Join us as we pursue our disruptive new vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about solving problems using data and seek to deliver the best experience for customers. At Splunk, we’re committed to our work, our customers, having fun, and most importantly to each other’s success.
We are looking for a Site Reliability Engineer focussing on the SignalFx and APM product lines. Site Reliability Engineers at Splunk are hybrid software/systems engineers whose overarching goal is to ensure that Production Services are always up and running reliably. They are also responsible for improving Operational Efficiency, Utilization and System Resiliency of the Platform. They own Critical Open Source Software that our platform relies on, and are core participants in every significant engineering effort underway in the platform.
- Responsible for automating & operationalizing engineering tasks on Backend Services - data migrations, performance tuning, capacity changes, etc
- Monitor Capacity & Utilization and work closely with the Infrastructure team to orchestrate scale-up/down of Backend Services.
- Own & operate critical back-end Open Source Services like Cassandra, Kafka, Zookeeper, Elasticsearch, Druid etc.
- Build tools and design processes that help improve observability and system resiliency of the SignalFx Platform.
- Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents.
- Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators.
- Establish design patterns for monitoring, benchmarking and deploying new features for the backend services.
- BS degrees in Computer Science or related technical field, or equivalent practical experience.
- 5+ years of experience as a Site Reliability Engineer, Production Engineer or Backend Software Engineer for web-scale or similar platforms.
- Coding experience in one or more of Python, Bash, Go or Java.
- Experience building or operating high performance distributed systems.
- Experience with one or more OSS technologies like Kafka, Cassandra, Zookeeper or Elasticsearch.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.