Root cause analysis, or RCA, is the process of identifying the cause of a problem so measures can be taken to prevent that problem from happening again. RCA assumes it’s more effective to resolve problems by addressing the underlying cause rather than just the symptoms.
For a real-world illustration, imagine you notice that your car has consistently low engine oil. You can respond by just adding more oil whenever the levels dip, which will keep your engine lubricated and prevent wear from friction and heat. But you would just be treating the symptom — spending a lot of time and money in the process to keep your oil levels topped off — because the oil would inevitably run low again. Alternatively, you could take the car to a mechanic who could investigate many possible issues — a leak from a bad gasket, high oil consumption due to worn engine components, etc. — to identify the root cause. In this case, getting to the root cause of the problem fixes your engine so you won’t run low on oil again.
Every industry can use RCA, but it’s especially helpful in IT. RCA provides a systematic analysis process to identify problems within complex modern infrastructures accurately and quickly. It can also help with risk management and significantly reduce costs by helping teams identify the root of the problem before they have a domino impact on the system. RCA is so effective that it is mandated in many industries.
In the following sections, we’ll look at how to conduct a root cause analysis, outline principles and best practices to follow, and tell you how to get started with RCA in your IT environment.
How to a identify a “root cause”
There is no single way to identify a problem’s root cause, and the process will vary across industries and organizations. In the context of software projects, RCA is usually conducted by a dedicated RCA team composed of personnel who are familiar with the problem and led by an RCA manager. This function is also sometimes called “incident response” and root cause analyses are then conducted as part of a post-incident review.
A basic framework includes the following steps:
- Identify the problem: The first step upon recognizing an issue is to define a problem statement and the symptoms (e.g., a machinery malfunction, a failed or faulty process, or human error). Once that’s done, it’s important to isolate any suspected contributing factors to contain the problem while you try to uncover the root cause.
- Collect data: Once the problem is identified, compile as much data as possible, including incident reports, evidence in the form of screenshots and logs, and interviews with anyone involved with the issue. Using this data, you can determine the sequence of events, and especially any adverse events that led to the problem, as well as the systems that were involved, how long the problem occurred and the overall impact.
- Determine root cause: The RCA team conducts a brainstorming session using techniques such as Fishbone diagrams, Pareto charts and other tools to ascertain the root cause. The RCA manager moderates the meeting, which should be collaborative and blameless.
- Implement the solution: The root cause may point to one or more solutions, and the RCA team has to determine which fix is best and when it should be delivered. Once the solution is implemented, it must be monitored to ensure it’s effective. This process is more formally called Root Cause Corrective Action.
- Document actions: A critical part of RCA is preventing the problem in question from reoccurring. Documenting the problem and its resolution so teams can reference it in the future is essential. The RCA team can also include recommendations for physical or process improvements as well as preventative actions in the documentation.
Three key three steps for root cause analysis
The three steps to root cause analysis are contained in a process known as the Six Sigma approach to quality management.
Six Sigma is a popular methodology for making business processes more effective and efficient, aiming to improve quality by finding defects, determining their cause, and improving processes to minimize the variability and increase overall consistency.
Six Sigma uses data-driven analysis methods and systematic approaches to meet improvement goals. One of these is a framework called DMAIC, used to improve existing business processes. Each letter stands for a step in the framework:
- Define the problem and the project goals.
- Measure in detail the aspects of the current process and its performance.
- Analyze the data to identify factors that impact process performance and determine the root causes of any problems.
- Improve the process by developing and testing solutions.
- Control how the process is done in the future until it’s stable.
In the “analyze” phase, Six Sigma employs five specific types of analyses to promote project goals: source, process, data, resource and communication analysis. Of these, source analysis attempts to find defects using a three-step RCA process:
- The open step: In the first phase, the project team brainstorms all possible explanations for the problem using techniques such as the cause-and-effect Fishbone diagram.
- The narrow step: During this phase, the project team narrows the list of possible explanations.
- The close step: During this phase, the project team validates the narrowed list of explanations for the problem.
Six Sigma can be used to improve ITOps and software development processes. Its tools and techniques can help identify the reasons for system failures, high defect rates, missed deadlines or any other problems that impact product quality, system performance and customer satisfaction.
Core principles of root cause analysis
Effective root cause analysis is guided by several core principles, most of which are reflected in the process steps outlined earlier, including the following:
- The primary goal of RCA is to identify the underlying cause of a problem so that teams can determine and take corrective action to eliminate it. Fixing the root cause can prevent the problem from recurring.
- Although RCA should be focused on correcting the problem’s root cause rather than just treating symptoms of the problem, it should not altogether ignore symptoms if addressing them can provide significant short-term relief.
- Incident investigation is a critical process of RCA, requiring a systematic approach and appropriate procedures that will yield accurate results.
- There is usually more than one root cause for any single problem.
- To achieve the most accurate understanding of a problem’s root cause, the analysis technique must establish a relationship between the identified problem, the root cause and contributing factors via a timeline or sequence of events.
- RCA should be blameless, focusing on how and why the problem occurred, and the primary cause. Participants know that RCA should not be concerned with identifying who is responsible or laying blame, which in turn allows team members to fully participate in the process without fear of making mistakes.
- Conclusions about a root cause must be supported by factual evidence, not opinions, hunches or guesses.
- A real root cause may indicate multiple solutions.
- When considering all possible solutions to a problem, the goal is to prevent recurrence in the most efficient way and at the lowest possible cost.
RCA is a holistic approach to problem solving that should strive not just to discover the root cause, but provide enough factual context to suggest effective corrective action.
Using the cause-and-effect Fishbone diagram
The Fishbone diagram is a cause-and-effect diagram used to visualize the potential reasons behind a problem that helps determine the root cause. Created in the 1960s by University of Tokyo professor Kaoru Ishikawa, the model is also known as the Ishikawa diagram, and it is considered one of the seven basic quality tools, per the American Society for Quality.
As its name suggests, the diagram depicts a fish skeleton laying on its side. The head, positioned on the right, represents the problem while the ribs extending off its spine represent categories of contributing factors. Bones extending from each of the ribs denote possible causes or causal factors within that category.
The Fishbone diagram follows a four-step process:
- On the head of the fish, write down the problem on which you’re conducting RCA
- Identify as many categories or contributing factors to the problem as you can — four to six is usually a good number to start with, and you can add more as needed. Toyota popularized a classification system called the “6 Ms” — Man (or People), Machine, Material, Method, Measurement and Milieu (or Environment). These are still good categories to start with for many problems.
- Brainstorm possible causes for the problem and place them under the appropriate categories.
- Decide which causes to address first. Select one to three that you think of as potential solutions, are feasible to implement and are likely to succeed.
For a Fishbone diagram to be effective, follow these best practices:
- Focus on causes, not solutions: Because the ultimate goal of RCA is to fix a problem, it will be tempting for teams to brainstorm solutions rather than causes. The facilitator can remind teams of the difference. “Add support staff,” for example, is a solution; “the support team is understaffed” is a cause.
- Don’t get hung up on choosing the “right” category: It’s common for people to become fixated on choosing the right category for each cause, which can slow down the process and stifle the free flow of ideas. The facilitator should emphasize that the cause is more important than how it’s categorized and that some causes may fall into multiple categories. It’s also not uncommon for one category to have a disproportionate number of causes. In these cases, you can split that category into sub-categories.
- Don’t stop until you’re out of ideas: There is no prescribed number of causes you have to identify to complete the Fishbone diagram. Only when the discussion dies down should you consider whether you have enough to wrap it up and move on to the solution phase of the RCA.
Root cause analysis tools & techniques
In addition to the Fishbone diagram, there are a variety of other tools you can use to conduct root cause analysis. Each tool has specific benefits that make it more or less suited to a particular situation. Some of the more popular include:
The 5 Whys: One of the most commonly used tools for conducting an RCA is the 5 Whys method. As the name suggests, it uses the inquisitive approach of young children by encouraging you to repeatedly ask “Why?” after a question is answered to get to the root cause of a problem. It’s called “5 Whys” because it often takes an average of five whys to correctly identify the root of a problem, although it can take more or less depending on the issue. This tool is best used for problems with a single root cause..
To use the 5 Whys technique, follow these steps:
- Describe in writing the specific problem that needs to be fixed.
- Ask “Why?” the problem happened and write the answer below the problem description.
- If that did not find the root cause, ask “Why?” again and write that answer down.
- Continue in this way until the whole team agrees that you’ve uncovered the root cause of the problem.
Pareto charts: A Pareto chart is a combined bar and line chart, good for identifying the most significant factors when a problem has multiple causes. Factors are displayed as bars arranged in descending order and a line graph plots cumulative totals of each factor from left to right. In quality control, a Pareto chart is commonly used to identify the most common sources of defects or the most commonly occurring type of defect.
Scatter diagram: A scatter diagram, also called a scatter plot, uses a pair of data points and regression analysis to determine relationships between variables. It’s often used to graphically depict and test multiple potential causes uncovered through Fishbone diagrams or the 5 Whys method to see which ones have an impact on the problem.
To make a scatter diagram, you choose an independent variable (the potential cause) and a dependent variable (the problem). Then you observe the process to gather measurement data that will be used to generate the scatter diagram. When you have your data table, you plot the independent variables on the x-axis and the dependent variables on the y-axis. If the pattern shows a clear line or curve, it indicates there is a positive correlation between the cause and the problem. If the points on the graph form no clear pattern, then there is no correlation between that cause and the problem you’re trying to solve.
Common RCA best practices
Some best practices for root cause analysis include:
- Don’t make assumptions: An accurate root cause analysis relies on evidence-based decisions. Everyone involved should approach the problem without assumptions, focusing on using data and other factual evidence to support the hypotheses
- Cast a wide net: A single problem can have multiple causes and contributing factors, and even a single root cause can kick off a complex sequence of events. It’s important to look at as many factors over as wide a time frame as possible to give yourself the best chance of uncovering the true cause of the issue you’re investigating.
- Keep RCA teams diverse: RCA teams should be composed of personnel from across roles to approach the problem from different perspectives and generate varied solutions.
- Keep teams small: Brainstorming sessions are most productive when they include about 10 people. Any less will reduce the number and diversity of ideas, and more can impede flow and enable a few dominant voices.
- Drill down: With each new piece of evidence, your analysis should drill down with increasing granularity. Like peeling an onion, each layer of inquiry should bring you closer to the root cause of the problem and increase your likelihood of fixing it.
- Create a safe environment: Many issues are ultimately caused by human error. Make sure that people are aware that RCAs aren’t a blame exercise or a way to find people to punish; ideally, remind everyone at each meeting that the problem is what needs to be fixed, not the people responsible.
- Implement preventive action: After completing a root cause analysis, the final step involves determining what documents should be updated, which processes need to be modified, who needs new or re-training, and other considerations. Much of these will be determined by the RCA. The goal is preventative action that will ensure the resolved problem never reoccurs
What’s after root cause analysis?
After completing a root cause analysis, the final step is to implement preventive action. This involves determining what documents should be updated, which processes need to be modified, who needs new or re-training, and other considerations. Much of these will be determined by the RCA. The goal is preventative action that will ensure the resolved problem never reoccurs
Root cause analysis is essentially a form of problem solving, so to get started you first have to know that there’s a problem. Fortunately, developers and ITOps teams already have a few ways of surfacing issues in place:
- Observability: Application Performance Monitoring tools keep you informed about app behavior, and the metrics they capture can alert you to slowdowns and other code performance issues. Logs are also a primary source of discovery when something isn’t right. With logs, it’s sometimes difficult to determine if an issue indicates a problem with infrastructure or code, but root cause analysis can help you get to the bottom of things.
- Dissatisfied customers: Customers complaining about a bad experience with an app or service may be the clearest indication that something is wrong. Complaints are often the first step in detecting a bug, slowdown or other performance issue and are a frequent catalyst for root cause analysis.
- Real user monitoring (RUM): Real user monitoring allows you to notice and fix customer problems before they become angry and complain.
Each of these can alert you to infrastructure issues and provide the data you need to perform a systematic root cause analysis. To take advantage of it, you’ll need a tool that can provide real-time visibility into your network, capture that data, and make it make sense to you. These monitoring and observability tools use machine learning to interpret and correlate events from different device logs and reports produced by your infrastructure. Using these insights as part of root cause analysis can help you develop more effective solutions in less time.
The Bottom Line: RCA turns data about your infrastructure into insights
Root cause analysis is an essential process for uncovering why something went wrong — and even why something worked well — in your infrastructure. Establishing an effective RCA process takes time and effort, but it will pay off in more accurate and lasting problem resolution and create the conditions needed for your infrastructure to perform its best.