Chapter 6 | The Phases of an Incident: A Case Study

Outage: A Case Study Examining the Unique Phases of an Incident
 
Day One: Incident Detection

 

Around 4 p.m. Gary, a member of the support team for a growing company, begins receiving notifications from Twitter that the company is being mentioned more than usual. After wrapping up responding to a few support cases, Gary logs into Twitter and sees that several users are complaining that they are not able to access the service’s login page.

Gary then reaches out to Cathy, who happens to be the first engineer he sees online and logged into the company chat tool. She says she’ll take a look and reach out to others on the team if she can’t figure out what’s going on and fix it. Gary then files a ticket in the customer support system for follow-up and reporting.

 

Note that Gary was the first to know of the problem internally, but that external users or customers first detected the disruption and a sense of urgency did not set in until Twitter notifications became alarmingly frequent. Still, responding to support cases was of higher priority to Gary at that moment, extending the elapsed time of the detection phase of the incident.

 
Day One: Response

 

Cathy attempts to verify the complaint by accessing the login page herself. Sure enough, it’s throwing an error. She then proceeds to figure out which systems are affected and how to get access to them. After several minutes of searching her inbox she locates a Google Document explaining methods to connect to the server hosting the site, and is then able to make progress.

 

As Cathy attempts to investigate what’s happening, she is met with delay while trying to access the system. The documentation appears to be correct, but isn’t easily accessible. Too much time is wasted searching for critical information. How effective is documentation if it is difficult to find or out of date?

What small improvements can be made to shorten the time it takes to investigate, identify, and triage the problem? The response phase offers many opportunities to learn and improve.

 

 

Day One: Remediation

Upon logging in to the server, Cathy’s first action is to view all running processes on the host. From her terminal, she types:

cathy#: top

to display the running processes and how many resources are being used. Right away she spots that there is a service running she isn’t familiar with and it’s taking 92% of the CPU. Unfamiliar with this process, she’s hesitant to terminate it.

Cathy decides to reach out to Greg, who may be able to shed light on what that process is and what the next best steps are. He deals with this stuff a lot more often. She pulls up her chat only to see Greg isn’t available on Slack; he must be out of the office. Cathy then starts looking through her phone for a number to call. After a few moments searching her contacts, she realizes she doesn’t have his mobile number. So she pulls up her email and searches for Greg’s name. At last, she finds his number in a shared Google Doc someone had sent out nearly a year ago.

Greg joins the investigation process and asks Cathy to update the status page. Not sure how to accomplish this, she nevertheless replies “Will do,” not wanting to burden Greg with explaining how this is done while he digs into higher-priority issues. Cathy is able to get in touch with Gary from support, and he walks her through updating the status page. Gary mentions that he’s now received 10 support requests and seen a few additional tweets regarding the site being down.

Cathy reaches out to Greg now that he’s signed on to chat to let him know the status page has been updated and to ask if there’s anything else she can do. Greg responds that he’s figured out the problem and everything seems to be clearing up now. Cathy confirms from her own web browser that she is able to get to the login page now, and Gary chimes in to say he’ll take care of updating the status page again.

When Cathy asks Greg in chat what he found and what he did, he says that he found an unknown service running on the host and he killed it. Following that, he gracefully stopped and started the web services and archived a few logs so he can take a closer look at them later. He and his wife are celebrating their daughter’s birthday, and guests are beginning to arrive. Once service was restored, he needed to get back to what he was doing. Cathy offers to send out a calendar invite for them to discuss what went wrong and report the findings to management.

In total the service was offline for approximately 20 minutes, but Gary informs the others that he has now received 50 support requests and people are still talking about the company on Twitter.

 

What would you have done had you discovered an unknown service running on the host? Would you kill it immediately or hesitate like Cathy?

Thankfully Greg was able to help restore service, but at what expense? A more severe problem might have forced Greg to miss his daughter’s birthday party entirely. How humane are your on-call and response expectations?

How mindful were the participants of the external users or custom‐ ers throughout this phase? Often, customers are constantly refreshing status pages and Twitter feeds for an update. Were transparent and frequent updates to the end users made a priority?

This is a great opportunity for a discussion around the question “What does it look like when this goes well?” as first suggested in Chapter 1.

 

Day Two: Analysis

Cathy, Greg, Gary, and several additional members from the engineering and support teams huddle around a conference table at 10 a.m., with a number of managers hovering near the door on their way to the next meeting.

Greg begins by asking Cathy to describe what happened. Stepping the group through exactly what transpired from her own perspective, Cathy mentions how she was first alerted to the problem by Gary in support. She then goes on to explain how it took a while for her to figure out how to access the right server. Her first step after accessing the system was to check the running processes. Upon doing so she discovered an unknown service running, but was afraid to kill it as a remediation step. She wanted a second opinion, but explains that again it took her some time to track down the phone number she needed to get in touch with Greg.

Several engineers chime in with their opinions on what the service was and whether it was safe to stop or not. Greg then adds that those were his exact first steps as well and that he didn’t hesitate to kill a process he wasn’t familiar with. Cathy asks, “Did you run top to see which process to kill?” Greg responds, “I like htop a little better. It’s easier to read.” “Hm. I haven’t heard of that tool, I’ll install it for the future,” Cathy says.

Gary adds that he took over updating customer stakeholders through the status page and support Twitter account, allowing Greg and Cathy to focus solely on restoring service. He says he also reached out to Debby, the social media manager, to ask her to help share the updates from their support account in hopes of easing the escalating chatter on Twitter.

Notice how Cathy describes the timeline in a very “matter of fact” manner, rather than attempting to defend any part of it. Along the way, she points out friction in the process. Hard-to-find documen‐ tation and contact information slowed down the recovery of service.

Cathy was transparent about what she saw and did and, maybe most importantly, felt. Not killing the unknown service was a judgment call. It’s important to understand how she came to that decision.

While the timeline is being thoroughly discussed, a list is generated on a shared Google Doc displayed in the flat-screen TV at one end of the conference room. It includes summaries of communication interactions as well as anything the participants have learned about the system. The list of findings or learnings looks like this:

Learnings

1.) We didn’t detect this on our own. Customers detected the outage.

2.) We don’t have a clear path to responding to incidents. Support contacted Cathy as a result of chance, not process.

3.) It’s not common knowledge how to connect to critical systems regarding the service we provide.

4.) Access to systems for the first responder was clumsy and confusing.

5.) We aren’t sure who is responsible for updating stakeholders and/or the status page.

6.) A yet-to-be-identified process was found running on a critical server.

7.) Pulling in other team members was difficult without instant access to their contact information.

8.) We don’t have a dedicated area for the conversations that are related to the remediation efforts. Some conversations were held over the phone and some took place in Slack.

9.) Someone other than Greg should have been next on the escalation path so he could enjoy time with his family.

Armed with an extensive list of things learned about the system, the team then begins to discuss actionable tasks and next steps.

Several suggestions are made and captured in the Google Doc:

Action Items

1.) Add additional monitoring of the host to detect potential or imminent problems.

2.) Set up an on-call rotation so everyone knows who to contact if something like this happens again.

3.) Build and make widely available documentation on how to get access to systems to begin investigating.

4.) Ensure that all responders have the necessary access and privileges make an impact during remediation.

5.) Establish responsibility and process surrounding who is to maintain the status page.

6.) Define escalation policies and alerting methods for engineers.

7.) Build and make widely available contact information for engineers who may be called in to assist during remediation efforts.

8.) Establish a specific communication client and channel for all conversations related to remediation efforts and try to be explicit and verbose about what you are seeing and doing. Attempt to “think out loud.”

9.) Come up with a way for engineers to communicate to their team availability to assist in remediation efforts.

As each action item is documented, someone in attendance takes ownership of it and offers to file a ticket so that time to work on it can be prioritized.

Bill, a member of the product team, admits that these new action items are a high priority and suggests they be scheduled immediately before something like this happens again. “We have a few SLAs in place and I’m worried we are already causing concern for some important customers,” he says. Then he adds, “Wait! No one ever said anything about the root cause of this. Was it that unknown service that was running?”

Greg responds with, “Yes and no. That service shouldn’t have been pegged like that, but there’s definitely more to the story. I Googled it last night after my daughter’s party and it has something to do with a backend caching service. I’m manually checking it every couple of minutes for now and I’m going to do some more research on it. The caching service touches a lot of things, including many services I’m not that familiar with, and until I can get a better understanding of how all of those services are interacting with each other, we need to be able to spot a similar problem quicker. Besides, we have scheduled work to replace that caching service pretty soon.” Bill mentions he thinks that work should be prioritized if it’s going to help. “That would be awesome,” Greg says. “I’ll file a ticket and prioritize this first while the others tackle the other tasks.”

Cathy offers to summarize the timeline of events, findings, and action items from the meeting and share the Google Doc with the entire company. She’s already had several people come by her desk asking about what happened last night, including some from the Clevel. “I’ll also create something we can post to our status page to let our customers know what happened and what we are doing to make our service more reliable.” “Good call,” Gary says. “I can help you with that if you want.” The group decide that further investigation is needed to understand how the caching service played a role in the problem. Until then, attention will be focused on replacing that service and implementing countermeasures and enhancements that make the system as a whole much more available.

Recap

This example illustrated a fairly brief Sev2 incident in which the time to resolve was relatively low. Lengthier outages may result in exercises that go on longer. The total time of this exercise was just under 30 minutes, and it resulted in 9 extremely beneficial tasks that will make a huge impact on the uptime of the site and the overall reliability of the service the business is providing.

Let us help you make on-call suck less. Get started now.