Have you ever received a call from one of your users saying, “My application is running too slow – I want a real server, not a virtual one?” or how about a call saying, “All my virtual machines just got powered off! “. As the administrator for your VMware environment, your job is a tough one. You know your environment is really stable and VMware is saving you lots of cash … and yet, your users tend to turn around and blame virtualization for their problems. Some of these problems are real and many of them imaginary. Wouldn’t it be great to know about problems in your hypervisor before they happen? How about reporting on historical data so people have an idea about how your environment is really performing?
Well, the solution to your problems is right around the corner. Before I tell you what that is, a quick note about how Splunk can be used to solve at least some of your current problems. Splunk eats ALL IT data and our customers tell us the number 1 thing they like about it is that it can correlate problems across many different layers in their IT environment. And that is the number 1 thing that makes Splunk great today for virtual environments. It can, very scalably, eat data across every layer in the virtual environment. So, you can keep tabs on how your applications inside virtual machines are performing, correlate problems with other incidents happening in the ESX layer or at the server, storage, network hardware level. How? Just let your applications log data to Splunk and enable your ESX servers to send logs over syslog to splunk . Splunk can also eat Windows/Linux virtual machine logs and configs as well as network device (routers, switches, firewalls) logs.
Okay- so now you have all this data in Splunk – what are things you can find in ESX logs that you wouldn’t have found otherwise? Here’s an example: you get that call saying,” My application is running too slow!” – You look at the aggregate log data from ESX and it shows a bunch of “SCSI aborts” from the ESX server your application is running on. “SCSI aborts” translated means that your virtual machines are facing contention while accessing shared storage and the hypervisor is giving up. If you had an alert search scheduled for this for your ESX environments, you would have been notified even before you got the call!
How about another example? Applications running on a particular host mysteriously get powered off. You look at your aggregate ESX logs and notice that Splunk shows syslog entries for the particular IP address as missing for more than 12 seconds but less than 15 seconds.This is an isolation response issue with VMware HA and you’ve managed to figure this out in seconds! In fact, you can create an alert to detect this particular situation in advance and reduce a whole lot of pain and suffering for yourself and your users.
So what is around the corner for us? Well, as a VMware admin, troubleshooting is likely your middle name and you need metrics for this. You need something to show you %CPU ready times, % CPU wait times to indicate when your CPU is too overcommitted, memory sharing metrics to show when you’re running short, things that you would use esxtop for. Well, we plan to bring this data into Splunk using the VMware APIs (similar to our *nix and Splunk for Windows Apps) and make it available for you to look at, report on, trend over time, save for posterity or throw it away, as you please!
If you want to participate in the beta for this, drop me a note at ljoshi AT splunk DOT com! Happy VMware splunking!
P.S – Many folks have emailed me about this – to splunk your VMware VC logs, just have Splunk directly index:
C:\Documents and Settings\All users\Application Data\VMware\VMware VirtualCenter\Logs
A couple of our leading customers are presenting on how to use Splunk(and other tools) in your VMware environment for issue resolution at VMworld! Please vote for their session “Issue Resolution with Pinpoint Precision” here..