Last week, the Splunk team was at VMworld Copenhagen on a very exciting mission- Monitoring Labcloud, VMware’s private cloud environment that runs all the labs at VMworld. The Splunk team had taken on the task of providing end to end visibility across a subset of VMware’s Labcloud environment. The technology tiers that Splunk was monitoring including LabCloud(a custom app), VMware vCloud Director, VMware vSphere, VMware vCenter Server, NetApp storage and networking devices in the environment. The environment was enormously dynamic with virtual machines being spun up and spun down; host configurations changing frequently. Needless to say, Splunk performed with flying colors! Some of our key wins:
- In a cloud environment, a lot of the provisioning and de-provisioning is delegated to the user through vCloud Director(vCD) – monitoring the health of vCD becomes critical to find issues in the completion of provisioning. And doing this before the user calls you! The screenshot below shows some of the troubleshooting we did for vCD..(you might need to click on the image to see it more clearly)
- Another thing we kept a close eye on was the ESX host performance – any spikes or slowdowns in CPU, memory, storage i/o etc. We pulled the metrics directly from the hosts bypassing VC – which meant that we could pull them at a much lower level of granularity and we could store them for as long as we wanted without worrying about impacting VC performance! And of course, so we could correlate host and vm metrics with everything else in the environment. A portion of the dashboard we used is below:
- We also monitored closely the multiple VCs in the environment for successful and failed logins, processes taking too long, and overall health of the VC servers. ( I had to blank out the login names, for obvious reasons)
- The below is possibly my favorite picture of all…as VMs moved from host to host, we even charted values of their key stats split by which host they are on. Talk about deep visibility into the health of VMs..
- We also scoured logs from every tier for problems – vprob errors, vmotions that didn’t succeed etc from ESX hosts ,VC servers, network devices and the NetApp storage…a couple of examples from ESX/VC are below:
- Since we were collecting network device logs, storage logs as well we watched closely for errors and issues related to the hardware as well. Virtualization abstracts and exacerbates hardware issues – and the example view below helped us keep on track of those devices as well.
So – lessons learned? When things get cloudy, you need Splunk! Not only do you need it to watch things going on right now, you need the ability to go back in time and watch for what changed, what patterns caused things to fail – so you can avoid those in the future. Very very few technologies provide this level of flexibility ( any data, any format, any scale) – but it is becoming increasingly a necessity for cloud environments.
As usual, if you need details on how we accomplished all this in a matter of a couple of days, email me at ljoshi AT splunk.com. We’re happy to share, so you too can monitor your VMware based environments!