TIPS & TRICKS

Identifying Zombie, Chatty and Orphan VMs using Splunk App for VMware

Virtualization is difficult to manage given the complex moving parts from storage to networking to hardware. When you have a dynamic VMware environment with Distributed Resource Scheduler (DRS) and High Availability (HA) enabled, Virtual Machine’s (VM) in the environment can transition through multiple hosts and clusters and can potentially become unregistered VM’s. This can lead a VMWare Administrator to loose visibility for these VMs. In addition each VM in a datacenter could cost from a couple hundred dollars into the thousands (http://roitco.vmware.com) based on your environment and infrastructure costs.

In this blog post I will cover three types of VM’s that can exist in your VMware Infrastructure and requires additional attention. The definition of these VM’s vary, but I’m sure you will be able to recognize them regardless of the name I give them.

Zombie VM : Virtual Machine that uses less than certain amount of CPU for a period of time. (Example: VM using less than 5% CPU for over a thirty-day period.) Since Zombie VM’s are the VMs running very low CPU usage, it could be repurposed to run other applications when needed.

Chatty VM (Opposite of Zombie) : Virtual Machine that uses more than certain amount of CPU for a period of time. (Example : VM using more than 80% CPU over a week). Chatty VM’s are the ones probably moving from ESXi to ESXi host using vMotion based on utilization.

Orphan VM : There are multiple definitions for this type of VM. Here are a just some examples of what an Orphaned VM can look like:

  • Virtual Machine that was unregistered from vCenter Server but still running within the environment unmanaged.
  • Virtual Machine that exists in the vCenter database but is no longer present on the ESXi host.
  • Virtual Machine that exists on a different ESXi host than expected by the vCenter Server.

In many occasions, actively running Orphan VMs is a security concern since they are not visible to vCenter Server and thus the VMware administrator is unaware of them as well. The VM’s will not be patched and can go undetected from compliance and operational audits.

Orphan VM’s happen because of some of the following reasons:

  • After a vMotion or VMware DRS migration event.
  • After a VMware HA host failure occurs, or after the ESX host comes out of maintenance mode.
  • A virtual machine is deleted outside of vCenter Server.
  • vCenter Server is restarted while a migration is in progress
  • Too many virtual machines are scheduled to be relocated at the same time.
  • Attempting to delete virtual machines when an ESX/ESXi host local disk (particularly the root partition) has become full.
  • Rebooting the host within 1 hour of moving or powering on virtual machines.
  • A .vmx file contains special characters or incomplete line item entries.

In order to gather information from a complex environment like VMware, we will need to collect performance, log and configuration data from vCenter Server and ESXi hosts.

Splunk App for VMware provides deep operational visibility into granular performance metrics, logs, tasks and events and topology from hosts, virtual machines and virtual centers.

Splunk App for VMware provides:

  • Proactive monitoring of your virtual infrastructure.
  • A visual interactive topology map of your virtual environment, highlighting problems and statistical comparisons based on predefined customizable thresholds.
  • Views that provide insight into how you environment performs with details on performance, availability, security, and capacity and change tracking.
  • Capacity Planning and Capacity Forecasting dashboards.
  • Correlation of VMware virtualization data with NetApp NFS datastores.
  • Views that show the operational health of your environment, identifying underperforming or distressed hosts, virtual machines, and datastores.
  • A security view that provides visibility into potential security breaches and non-compliant usage patterns.
  • The collection of granular performance metrics and log data all in one place, directly collected from VMware vCenter Servers and ESXi and vCenter logs (collected via syslog).
  • The ability to explore very large data volumes, at speed, with access to fast queries on performance data.
  • Track changes with visibility into VMware vCenter Server tasks and events in the context of your virtual environment.

 

Splunk App VMWare dashboard

Going back to basics of core Splunk, we can create our own searches, reports, alerts and dashboards on top of any Splunk app. With these additional dashboards we can identify, validate and repurpose these VMs that was mentioned above.

Lets go ahead and identify Zombie, Chatty and Orphan VMs by custom search command.

(sourcetype=vmware:perf:cpu source=VMPerf:VirtualMachine) OR (sourcetype=vmware:inv:vm changeSet.name=*) | eval detect = if(p_average_cpu_usage_percent < 5.00, zombie, if(p_average_cpu_usage_percent > 80.00, chatty, normal)) | stats first(detect) as CPU Status by moid

 

2

We can put together a very cool dashboard to show all the Zombie, Orphan and Chatty VMs.

 

3

Since the zombie and/or orphan VM’s could be repurposed for other usage, we can calculate the total cost for removing or repurposing the troubled VM’s.

This could help you show your management how much you saved the business with real savings!

(sourcetype=vmware:perf:cpu source=VMPerf:VirtualMachine) OR (sourcetype=vmware:inv:vm changeSet.name=*) | stats first(detect) as CPU Status first(changeSet.name) as VM Name first(p_average_cpu_usage_percent) as Avg CPU Usage by moid | stats count(moid) as moid, count(VM Name) as vms | eval cost = (moid vms)*$price$ | table cost

 

4

Splunk can help your organization repurpose zombie and orphan VM’s to fully utilize your virtualization effort and to keep it secure. Splunk can also help identify chatty VM’s and move them to properly sized ESXi hosts.

Happy Splunking.

This blog post was jointly written by Tolga Tohumcu and Kam Amir…

----------------------------------------------------
Thanks!
Tolga Tohumcu

Splunk
Posted by

Splunk