Splunk Insights for Infrastructure is a new monitoring product from Splunk that combines metric and log-based monitoring and ease of use and installation into an inexpensive product that includes a FREE tier that can monitor about 50 servers. Take a quick video tour of the features, then download Splunk Insights for Infrastructure at http://splunk.com/insights-for-infrastructure
Splunk Insights for Infrastructure is an analytics-driven IT operations tool that unifies metrics and logs for troubleshooting and monitoring entities both on premises and in the cloud. Setting up the insight takes minutes, and in short time you'll be collecting vital system metrics and logs from entities in your hybrid infrastructure.
The first step on the journey to infrastructure insight is getting data in. And with Splunk Insights for Infrastructure, that problem has been reduced to a couple of quick copy and paste commands for both metrics and logs.
From the Add Data page, you can configure what metrics and logs you collect from your virtual, physical infrastructure. As you can see, you have your choice of metrics, as well as your ability to add specific custom logs.
Dimensions allow you to specify any key value pairs that help you identify and group your entities. You can think of these as the owner of the server, the environment in which the server exists, or the actual physical location of the entity. When you're ready to start sending your data to the server, you can just copy and paste this command and collectd and Splunk Forwarder will be installed and configured for you. You'll start seeing data flowing in within five minutes.
Getting data in from AWS is even easier. Just enter your account information and select your data sources, and within a few minutes, you'll be gathering data. Now that we have data coming in, we'll be looking at how you can use metrics and logs to troubleshoot some common problems. We have some entities where we gather data using collectd and some AWS EC2 instances, along with their associated EBS volumes. We've had some reports that we're getting high CPU usage on some of our bastion entities. So the first thing we're going to do is create a group for those entities.
Using the Group Analysis Workspace, we're going to find our system user metric, and we're going to be looking for some spikes in max CPU utilization. And it looks like we have a couple of different spots for CPU is spiking. Let's see if there are any associated log events.
So it looks like we also have log events that correlate roughly with the time of those CPUs spikes. However, we can do more than just look at the counts of logs during that period of time. We can go one level deeper. We can actually see the raw log event. So we can see what was happening on this host at that time. And it looks like it was a couple of authorized users logging in, so really nothing to be concerned about at this time.
But let's look at something a little bit more complicated. Going back to my Groups view, I have a typical three-tier web stack, and I've put it all together here within my website group. This has all of my entities for my API delivery, my web server delivery, as well as my database delivery. I've put all these related entities into a single group in order to do a post-mortem on an outage that occurred several hours ago. There was a problem with the database. So let's start with the MySQL logs this time.
Some of these logs are complaining about a disk full, so let's take a look and see what was actually happening on these hosts during this period of time. I'm going to go to my df metric, and I'm going to check my df free. From here, I can split this out by host, so I can see individual contributors to this metric. And if I hover over MySQL, too, I can definitely see a spot where disk free definitely dropped below 25%.
So let's go ahead and isolate and look only at that specific host. Yeah, there was definitely a disk free problem here around this time. This was also the host that had errors within the log file that we were looking at just a moment ago. We can also add other logs to see what the impact was on the middleware and the web tier during this period of time.
And we definitely had some significant drop in traffic to our site during that period of time. So when I look at all of these factors together, it looks like we have a pretty significant correlation between this MySQL server losing disk and our performance and our overall activity on our website. We can now save this condition as an alert, so I can look for it in the future.
When I hit Create Alert, it gives me this modal. From here, I can set up my different thresholds. Since I'm looking at disk free, I want to use the less than threshold. So I want to look for anything with 50% of disk free or less, and that's going to be a medium alert. But anything below 25%, that's a critical alert, and I need to know about that immediately. From here, I can also set contact information so that I can have an email sent to me whenever there is a trigger of that alert condition. Speaking of alerts, let's see if we have any.
Going to the alerts feed, I can see all of my alerts grouped by entity. Now, since I just created my alert for my MySQL server and my disk free metric, I don't expect to see it right here. But it looks like my DHCP server has had 19 different alerts in the last hour. When I click on this particular entry, I get a little bit more information about what was going on.
So it looks like my CPU system was spiking multiple points over the last hour. Currently, though, it looks like it's in a stable state, as indicated by the current severity, which is green, which means that everything is OK based on the thresholds that I have set up. However, I can investigate what was going on during this period of time to try to find a root cause.
Once I click Investigate, I'm taken immediately into the Analysis Workspace. From here, I can see an alert panel, which has all triggered instances of the alert condition over time. I can also see my thresholds. If I wanted to, I could edit my thresholds from here as well. What I really want to do is see if there is anything within my logs that give me some sort of root cause about what was happening on the server during this period of time.
Now, it looks like I have a Cron log, and let's see if that gives me any insight. And it looks like there was a couple of different log entries during this period of time. Let's go ahead and take a look at the raw bits and see if we can find anything. And it looks like there's a load test script that's running, and that's probably what's spiking my CPU usage. So I'm going to log into the server and shut down any running processes and take a look at this Cron script and see if there's anything going on that's causing it to hang up.
There are multiple ways to monitor and troubleshoot your entities with Splunk Insights for Infrastructure. One of the other ways is through the Infrastructure Overview. So going back to Investigate, I choose the Infrastructure Overview icon.
The Infrastructure Overview gives you a bird's eye view of your current entities. Here I can pick a specific metric. Let's use that df free metric. And then I can look for different entities that are crossing different boundaries. I'm going to set my minimum and maximum thresholds, and then I'm going to set the threshold that I'm looking for to 25%.
And since this is a risk free metric, I want to look for anything below that number, and I want that to be marked red. And it looks like we don't have anything that has less than 25% disk free. However, I can increment this forward until I find one-- yep, there it is-- that is just below 30%.
When I click through from this tile view into the Analysis Workspace, I can take a look at my df free metric. And since it's on a single entity, I can split this out by device, so I can see which individual devices are at most risk of running out of space in the near term. With Splunk Insights for Infrastructure, you have the ability to monitor and troubleshoot entities with both logs and metrics, collected and made usable in minutes, giving you immediate insight into your hybrid infrastructure. These tools enable you and your team to root cause problems without needing to learn any advance search languages or go through a long and cumbersome setup process.