This is a guest post contributed by Frank Larkin
Principal Engineer II Comcast, NETO.
For the last 2 years, Comcast “Video-On-Demand” has been transitioning from over 100 separate Video On Demand (VOD) systems (which we call snowflakes–because no matter how hard you try they will be all different) to a datacenter-based enterprise solution we call the Next Generation Video Control Plane (VCP). Along the way, Splunk was chosen by our VCP solution vendor to assist in reporting and troubleshooting. Splunk was chosen to provide visibility into a complex set of distributed systems in order to deliver video content in a reliable manner to our customers. Unfortunately, Splunk did not have a good handle on working in an enterprise that makes use of Global Server load balancers, topology routing and multi-datacenter redundancy. This article discusses the path we took to making Splunk a true partner in our enterprise.
To be a part of the Comcast enterprise meant that the solution had to be built to survive a datacenter failure while maintaining access to all data previously collected. When paying customers are involved, we need to be able to determine root cause of an issue as quickly as possible. In the past, this has meant going to many places to “see” what happened. Now when something happens, Splunk is the first place we turn. Splunk now gives us the ability to provide data redundancy through re-indexing across datacenters. This allows us to know what happened even if we lose our primary indexers in a dead datacenter.
Our current version of Splunk is version 4.3.3 because, as I mentioned earlier, we are tightly coupled with our vendor. This means we have to coordinate our updates with them. We expect to get an update soon that includes Splunk 5.0. As I was writing this, I was concerned that this discussion might be moot with respect to newer versions of Splunk, but after reviewing the features of Splunk 5.0 and Splunk 6.0 I realized the features discussed in this document are still very relevant and hopefully useful.
Note: For security purposes, all real FQDNs, machine names, and datacenters are never used. The names presented are “similar” with the naming conventions we use.
Enterprise Video On Demand
Video On Demand has grown from a novelty in the mid 1990s to an integral component of Comcast’s offerings. Over the years, the number of individual VOD systems in our footprint has grown to well over 100 separate systems. This was clearly unmanageable. Our Next Generation Video Control Plane (VCP) solution will serve all media customers from VCP components located in three datacenters. For this discussion I will refer to these datacenters as east, central and west. The VCP design will allow us to provide service that is backed up by VCP components in the other datacenters. At any given time we can lose a datacenter and continue to operate. We can also move traffic at will to allow for easier maintenance.
After a period of RFP, demos, and trial systems, our VOD partner was chosen. As part of their proposal they incorporated Splunk. In VCP this is primarily used for reporting, ad-hoc queries and troubleshooting. Our “Galactic Splunk” system includes over 40 bare metal indexers being fed by over 200 Splunk forwarders on virtual machines. The vendor gives us major version updates as ISOs and new VM OVF files. Having Splunk already installed and ready to be configured on all components is a must.
While not new to Comcast, Splunk was new to the VOD group so we were very concerned about:
- Ensuring that little customization of the Splunk components is required. It is very important to have the configurations as “cookie cutter” as possible.
- Ability to use the topology-based routing that we use every day at Comcast. We did not want Splunk forwarders in the east indexing data across the country as a normal operation.
- Indexer redundancy across datacenters. While not something that happens very often, losing a datacenter could obviously drive a bad experience to many customers from many areas. We had to be able to answer the question, “What was happening before the DC went dark?”
- Multiple redundant analytics components. Our chosen vendor has their UI and analytics search heads tightly coupled. We can execute Splunk queries from within their UI.
- Multiple Splunk deployment servers to add redundancy to our solution.
Splunk Common Configuration
We are using the Splunk deployment servers to configure all Splunk components. When a new VM becomes active, the Splunk components (universal forwarders, search heads, and indexers) are designed to look for a common network “C Name” or alias like deploymentserver.comcast.net. This uses the capability of the deployment server to configure all these components from a single location. All components, from an install perspective, are configured exactly the same.
- All Splunk forwarders get the same configuration, same list of indexers, using the same ports.
- All Splunk search heads get the same indexer list that can be adjusted by scripts in the deployment server.
- All Splunk indexers have their base configuration set by the deployment server. Special configs are done again on the deployment server using our vendor-supplied scripts. This makes management of the system very easy. These configurations include several features that I’ll discuss later, including…
- Entering a list of indexers that the forwarders can send Splunk data to.
- Entering a list of primary indexers and their replica indexers.
- Entering the list of indexers that can be queried. This list can be changed on the Deployment Server as conditions with the indexers change. It is all managed at the Deployment Server.
This allows us to deploy new VMs that wake up and know where to get their Splunk configurations, configure themselves per their role, and go on line doing their job. Common configs allow for easy troubleshooting. These capabilities are perfect to allow for “cookie cutter” implementation but it only goes so far. We use other processes to effect what components are used and when.
Topology Routing Using the GSLB and DNS
Since we have many indexers and many forwarders in many datacenters, how do we ensure the indexer data is being processed in the most network-efficient way possible? We don’t normally want eastern Splunk data to be indexed by a western indexer because of the distance. The answer is to use Global Server Load Balancers or GSLBs. These devices allow us to have all the forwarders route their requests to the indexers in their datacenters. If all indexers in a given datacenter are down, the forwarders can be automatically directed outside the home datacenter to indexers in other datacenters. They work in the following way:
In the GSLB we create what we call a “G Name” like vodindx.g.comcast.net. Associated with this name are all the primary indexers (I will discuss primary and replica indexers later) in all the datacenters. The GSLB performs health checks (discussed later) against these to determine if they are running and able to take on load. If they fail a health check, the GSLB will not include them in the picking list.
The forwarders are configured for Splunk load balancing. In this configuration they are given a list of indexers they can send data to. They then pick an indexer from the list of indexers, make a connection, send data to that indexer for a configurable amount of time, then drop the connection and start over. Splunk load balancing lets you provide the same name more than once to allow the forwarders to pick that indexer more often. We use the vendor-supplied scripts on the deployment server to configure an indexer list for our forwarders. In this list we only have 2 entries, vodindx.g.comcast.net and then vodindx.g.comcast.net. Since there are 2 indexers in the list, all the forwarders run using Splunk load balancing. They always end up calling out to vodindx.g.comcast.net (the GSLB) to pick their indexers. The GSLB does the actual load balancing. We have 2 names in there to ensure that the forwarders are always dropping the connection when they can and reconnecting. In our experience a single persistent connection is never dropped unless there is a fault. We want more predictability when failing over a datacenter for maintenance.
Now you may be asking yourself: how does the forwarder get routed to the appropriate datacenter? At Comcast we have set up topology routing based upon the location of the DNS servers. We have mapped all the DNS servers to the physical geographic state they are in, in the United States. We then map the states to a preferred and redundant datacenter. The GSLB now does the picking.
The process looks like this:
- A forwarder in the central datacenter disconnects from its last primary indexer.
- The forwarder picks a different indexer from its list. In our case it always tries to connect to vodindx.g.comcast.net. Since this is a Fully Qualified Domain Name (FDQN), the server needs to look this up from a DNS server. As you can imagine, at Comcast we have many DNS servers.
- A local DNS server gets the request. It takes the FQDN and looks at the domain and sees g.comcast.net. The DNS server looks inside itself and sees that it can resolve comcast.net but not g.comcast.net. It calls to an authority server that tells it that the GSLB can resolve g.comcast.net. The DNS then reaches out to the GSLB for resolution.
- The GLSB gets the request. It knows the DNS server (by its IP address) is mapped to a geographical US state that the central datacenter is in or near. The GSLB then uses the State-to-Datacenter mapping and the indexer health checks to pick an appropriate indexer. The IP of this indexer is returned to the DNS server.
- The DNS server gets the response (IP of the indexer) and returns it to the forwarder.
- The forwarder opens a connection and begins to send the indexer its data. It will use this for the time period specified in the Splunk load balancer config. The next time the forwarder tries to open the connection, it may or may not have to resolve the name based upon the TTL (time to live) of the entry returned from the DNS server.
This lookup can take a few milliseconds to a few seconds, but so far it has not been an issue.
This system is not 100% foolproof. There can be times when the GSLB returns an indexer in a different datacenter, for example if all the DNS servers nearest the requesting forwarder are busy or not responsive. If this happens, a more distant DNS server can be used. It is also possible that a DNS server is associated with a different datacenter. The result is the forwarder may be told to write to an indexer far away. While these are rare exceptions, it does not really matter because all the data is indexed and queried together. The upside is that there will always be many indexers available for indexing.
In the second half of this blog post, I’ll go over how we configured redundant systems and tested the load capabilities of our implementation.