Automated IT event correlation is a powerful tool in any engineer's toolkit. It provides a way for the engineer to explain why something is happening in the environment and fix it quickly.
Below, we will explain some important best practices for correlating events in an IT environment.
Event correlation overview
Event correlation can be described as a connection between two or more events within an environment. For example, if a network engineer makes a change and the dependent application fails, the change most likely correlates with the failure.
An experienced technologist should be able to predict and find these correlations, but ideally, you want your best engineers to be working on planned projects rather than troubleshooting incidents. This is where automated event correlation comes into play – it enables Level 1 operators to find the root cause of issues and frees your best engineers to work on other tasks.
Best practices for IT event correlation
Now let’s turn to today’s best practices for correlating IT events.
Use a data source that is a source of truth
A configuration management database (CMDB) is an excellent source of truth for your data. It’s used to store configuration item information and attributes concerning the configuration item (CI), such as:
- Assets (which can include the operating system)
- Asset types
- Hardware information
These asset attributes are critical for understanding why a particular server or device may be impacting the environment. They’re typically tied to changes and incident tickets within your ticketing system.
Leverage your ticketing data
Ticketing data can represent any changes or incidents that are entered into the ticketing system. It’s common to have a dedicated change management team that both:
- Reviews all upcoming changes within the ticketing system.
- Creates detailed plans prior to the execution of those changes.
Changes in the environment are one of the biggest contributing factors when it comes to major incidents. Ideally, the change owner would identify potential risks to dependent applications, but that doesn’t always happen. When a problem occurs and an experienced engineer is paged to a bridge call, they normally check for recent changes in the environment and attempt to rule those out as the root cause before they look any further.
Onboard your repositories
Version control is a must-have for anyone who relies on software to run their business. This can include:
- Modifying configuration files on your logging tool
- Making changes to a primary application that your company relies on
These repositories identify when a change was merged and when the code was updated. It is critical to understand any changes or updates that have been implemented within the environment when trying to identify the root cause of a problem.
Monitor your performance metrics
To meet the minimum standard for any modern monitoring strategy, you must understand and monitor performance metrics.
Monitoring alone won’t necessarily give you the best indication of the health of your system, as it’s not uncommon for some metrics to spike for a period and then return to normal levels. It’s critical to understand how this affects your application, though.
Not all performance problems make a significant impact, but you can still use this information collectively to explain why something impacted the environment in a certain way.
(Learn about observability, the ability to measure the internal states of a system.)
Stitch it all together
Now that you have gathered your CMDB assets, ticketing data, code repositories and performance metrics, the next question is “Where do I put them?”
One solution is Splunk — the unified observability and security platform that enables you to get maximum actionable insight from your IT systems and events. Plus, Splunk provides countless out-of-the-box functionality in the form of technical add-ons through Splunkbase.
Once you onboard these data streams to Splunk, you can stitch them together. It’s important to first standardize and aggregate your datasets into a usable form, then correlate them together based on time.
For example, consider the following scenario: A developer planned and approved a change, then scheduled it for Monday at 3AM. The developer merged the pull request and marked the change as complete. An hour later, you get paged about high CPU usage, and you’re notified that a critical application isn’t working as expected anymore.
In this case, you could simply take a look at Splunk and easily correlate the recent change with a particular asset in the CMDB. You would see that the CPU spiked right when pull request occurred, so you could just roll back the changes and restore everything to working order.
Centralization is key to correlation
Manually correlating events from a diverse set of tools can be challenging to say the least. That’s why you need a centralized logging tool that can do it for you.
What is Splunk?
The articles was written by Steve Koelpin. Steve is a former Splunk professional services consultant and 5x Splunk Trust MVP. He specializes in Splunk IT Service Intelligence, Splunk Machine Learning Toolkit, and general Splunk development. While not behind the keyboard, he is best known as dad.
This posting does not necessarily represent Splunk's position, strategies or opinion.