Chapter 2 | Build a Resilient Operating System Faster

Build A Resilient Operating System Faster

Trust is the foundation upon which we reach just a little higher and stretch a little further. Without trust, there are no risks taken, which means no exploration, experimentation, or advancement of the system (or society for that matter).

The advancement of the VictorOps service is largely based on trust. Trust and confidence in the process of building, deploying, and operating software and services. Trust in the development process. Trust the way in which software and services are deployed to customer-facing production environments. Trust that even when something goes wrong, we can recover extraordinarily quickly.

We must constantly explore new ways to maximize and meet expectations on reliability while simultaneously innovating and improving our service. We are a data-driven rocket ship, constantly swapping out components, experimenting with processes and tools.Iteratively learning and exploring more about the system’s “knowable” properties—all while the ship is in flight.

 

Four Key Assignments for Establishing a Culture of Reliability

The status quo for building and operating systems has long been for developers to hand off code to release engineers or operations teams to deploy and manage. Monitoring and alerting were afterthoughts, only bolted-on in the Production environment, if at all. Operations engineers and system administrators were paged for problems at any time, day or night. Taking a reactionary approach when it comes to reliability no longer met our needs.

There are more modern methods and approaches to increasing reliability available today that are better suited to how software and infrastructure are designed, built, and operated. Changing our posture from reactionary to proactive was the first thing we needed to change.

There are more modern methods and approaches to increasing reliability available today

 

In April of 2017, VictorOps kicked off official SRE exploration and documentation of our internal efforts and discussions regarding both reliability and scalability. Our documentation of this process would serve both as historical records for VictorOps as well as a resource to customers, prospects, and the greater IT community.

For VictorOps, SRE and the associated efforts are ongoing. This text includes four key assignments or exercises that helped VictorOps establish footing and move forward with confidence on our own journey towards building a highly available, resilient, and reliable system and service that is constantly improving and bringing more value to end users.

 

Four Key Assignments for Establishing a Culture of Reliability:

 Identify “What Keeps You Up At Night?”

• Determine Value to Effort for Observing What Keeps You Up At Night

• Establish Blackbox Metrics and Service Level Expectations

 Make the Case For Chaos or a Game Day Exercies

Let us help you make on-call suck less. Get started now.