Faster, Smarter Resolution: The Incident Response Guide for Kubernetes
Inevitably, organizations that use technology (regardless of the extent) will have something, somewhere, go wrong. The key to a successful organization is to have the tools and processes in place to handle these incidents and get systems restored in a repeatable and reliable way in as little time as possible.
The solutions that an organization is able to implement will depend on three things: “people, process, and technology.” This phrase was coined by Harold Leavitt and encapsulates the truism that we often get so focused on the technology that we overlook the people who are involved with it (Leavitt, “Applied Organizational Change in Industry,” Carnegie Institute of Technology, Graduate School of Industrial Administration, 1962).
For proper incident response to occur, people from all segments of an organization must have a stake in the operational life for it to be successful. This extends from the business units that use the application or provide client support to the various technical teams that handle the day-to-day maintenance of the application and the infrastructure that it runs on.
The software development teams write code, define container dependencies, and kick off builds and deploys through the DevOps processes and tools, and they should be aware that they are responsible for what they produce. In modern software development, the continuous integration and deployment of lifecycles means that a code change can be committed at 2 a.m. and deployed globally before the team lead even gets to the office in the morning. The escalation of incidents to development organizations should be as routine and hands-off as possible through integration with defect tracking systems. An after-hours call for help should be reserved for emergencies such as the discovery of a critical security flaw, ongoing data corruption, or when the application is in a state of total failure.
The term “The Business” is intentionally vague. It covers all of the non-technical areas of the company, from the people who approve the funding for those massive monthly Google Kubernetes Engine (GKE) bills, to the marketing teams who gather requirements, to customer service teams who actually speak to the customers experience problems, and everyone in between. It is essential to include these teams in any incident response improvements, since they are ultimately the people for whom the technology teams are building applications. The more they know, the less they will bother the technology teams, thereby enabling the latter to focus properly on responding to and resolving the incident at hand.
If you feel like you’ve said it 1,000 times, then say it 1,001, because there’s always someone who didn’t know. Awareness of both the status of all core systems and their performance (from perfect to degraded or unavailable) is key and will make it much easier to justify new expenses like replacing your twenty-year-old logging system with one that can handle cluster-wide logging from Kubernetes.
# docker ps -a | grep kube-apiserver
At this point you can retrieve the contents of stderr and stdout directly from the running process with the command:
$ kubectl get nodes
NAME STATUS AGE
192.168.1.10 Ready 2d
192.168.1.11 Ready 3d
Service discovery in Kubernetes happens in one of two ways: the first is through hard-coded variables that can not be dynamically updated, and the second is via DNS records. Obviously, essentially everyone uses DNS because it is dynamic and more flexible.
Multime components are involved in order to use DNS to support service discovery successfully. Troubleshooting can be very different depending on the products used. You can have DNS configured within each running pod, node, cluster, or simply have one big external DNS. The best option for you depends on your scenario and sensitivity to risk.
One common solution is to have containers use the local host for DNS resolution. The local host is configured with dnsmasq or an equivalent to handle DNS routing and caching. By handling DNS on each host, the cluster is more tolerant to minor disruptions on the control plane. The goal of the DNS routing is for external DNS requests to be sent to the proper external networks and for requests for internal services (typically using the dot svc domain) to be resolved using the DNS records managed by kube-controller-manager. Resolving issues at this level can be as easy as testing dns lookup from the command line of each node to see which one is giving the wrong answer and then investigating the configuration on that specific host.
A web UI (regardless of whether it is the default Kubernetes Dashboard or something more complicated like Cockpit or Rancher) typically runs as a service that manages a few pods within the Kubernetes cluster. Most incidents can be handled with standard application tactics. The biggest exception to this is when there is a problem with authentication, since it relies on a token from the cluster’s RBAC sub-systems. Dashboard uses either a straight kubeconfig or a bearer token generated for an individual service account.
If the web UI is more complex (like Cockpit), then it can use more advanced authentication such as certificate-based authentication or an external identity provider (which could be anything from an htpasswd file to a full enterprise class SSO platform).
Kubernetes clusters are increasingly using external load balancers on most public clouds by leveraging what the cloud offers (like ALB on Azure) since they are fully managed offerings. Depending on your requirements and where your cluster is deployed, there could be thousands of clusters leveraging load-balancers running within the Kubernetes cluster (ex: nginx, haproxy) as well as on-premises clusters using existing application delivery controllers outside of the cluster (ex: F5, NetScaler).
Load-balancers are used to handle traffic that is ingress to the Kubernetes cluster and routed using defined services which map to pods on the individual nodes. In the future, there may be even more uses for the combination of intracluster load-balancing with service meshes.
While operators aren’t technically addons to Kubernetes, they are not a core function in most distributions. An operator is essentially a custom application-specific controller that knows how to create, manage, upgrade, and destroy instances of that application. Operators can be written using Helm, Ansible, or Go. If an incident occurs around an operator, it is best to contact the providing vendor and upgrade to a later release. While every operator is built using the same SDK, they have enough individual nuances that it is crucial to engage internal development or the vendor.
This functionality is an addon for Kubernetes, even though it is commonly deployed and becoming standard.
Incidents that involve the service mesh will often be the result of misconfiguration or a problem with the side-car proxy. As every pod will automatically have a sidecar proxy loaded and all network traffic will pass through it, this is the best place to start diagnosing issues.
Other common issues relate to the way in which traffic routing is configured when new versions are being deployed. Examples include when new instances are not rolling out fast enough to handle the increasing traffic volume or when old instances are being shut down before the traffic has been fully quiesced.
Since most current service mesh products can seamlessly integrate with the cluster and application diagnostic tooling, managing the incident will follow a fairly standard flow. Issues that don’t involve the redeployment of entire applications can only be resolved by the SRE team.
While some may consider a container registry to be a core service of Kubernetes, it is only necessary for pulling new images in order to maintain the state within a running Kubernetes cluster, and there are no requirements for a registry.
Internal corporate guidelines will differ widely, and there are quite a few options on the market, including Quay, Docker Hub, GitLab, ECR, ACR, and GCR. If a container registry is causing problems, this will be easily identified during the deployment process. If that is the case, the two most common error messages are: version not found and invalid credentials. Less common errors include network timeouts and typos in the name of the requested object. In more advanced organizations with more security and quality controls, the deployment can fail if the proper labels are not applied to the requested container.
The tools that are used as part of the CI/CD pipeline or SRE team to handle building and deploying infrastructures (such as Ansible, Terraform, or CloudFormation) or deploying code into a Kubernetes cluster (such as Azure DevOps, AWS CodeDeploy, or Travis-CI) will vary by environment. The DevOps and development teams are responsible for figuring out why things fail at these steps. It can be as simple as a missing dependency or as complicated as the deployment of the incorrect version of runtime to the platform. Since containers can run on ARM, x86_64, Power, and even IBM Z systems, using the proper runtime can make a huge difference.
While every team should be able to see the monitoring and alerting data that other teams are receiving, different teams will be better equipped to handle different types of components during an incident response. There will always be an operational center of excellence (CoE) that will be first in line to apply known fixes to known problems in any organization with any kind of scale. If they can not resolve the problem, they will escalate to higher tiers based on the type of incident. These operational centers can be anything from a command center, network operations center (NOC), security operations center (SOC), service desk, or even a dedicated application support team for high value apps.
The actual time that it takes to triage and escalate will be determined by a combination of factors including severity, client-specific service level agreements (SLA), the time of day, or even the day of the year (for example, 11:00PM on the night before Black Friday is more important than 2:00AM on any given Sunday to the average retailer.)
This section offers a view into various popular Kubernetes distributions and managed offerings methods for metric collection and log management. It will also cover the integration of native tooling with an external system Incident Response solution so that the metrics and events can be used to build alerts (which are a cornerstone of any incident response solution).
This is not an exhaustive list of Kubernetes platforms, but rather, a sampling of some of the most popular offerings. There are currently 40 Certified Hosted Kubernetes offerings and 58 Certified Kubernetes Distributions.
AWS has an interesting relationship with Kubernetes. Kubernetes has become the industry standard orchestration engine for containers, but while AWS has made it available to customers through their EKS offering, it still prioritizes development and marketing of its in-house container management offering, ECS.
The easiest way to monitor the control plane of EKS for errors is to leverage CloudWatch for logs and CloudTrail to capture all of the API calls within EKS. To expose metrics on the worker nodes for use in monitoring AWS, a tool called metrics-server needs to be enabled and deployed into the Kubernetes cluster. As is standard with many Kubernetes offerings, AWS recommends Prometheus for centralizing the tracking and trend analysis of the metrics gathered across the worker and control plane nodes.
Connecting from AWS native tools to external solutions for notifications is almost always done by configuring SNS topics. This allows for pushing data to webhooks, SMS, and email. When connecting to VictorOps or a similar tool, the result will be multiple SNS topics pointed at the same webhook. That webhook is the AWS CloudWatch integration point from which you can filter and route, based on message content within the VictorOps platform.
Red Hat OpenShift is a product suite with multiple offerings based on Kubernetes and several other open-source products. It is a complete solution for enterprises looking to get up and running with containers, since it has all of the tooling required to go from source to production and can also update itself. The difference between the various products in the portfolio (such as ARO and OCP) is whether the deployment model is hosted or on-premises.
With all of its components as well as its reliance on operators to handle updating and self-healing, there are a lot of moving parts that will need to be watched and managed. OpenShift includes two tools that will be used as part of an incident response solution to generate alerts. It uses Prometheus for event monitoring and trending, and ELK to do log consolidation and visualization.
The primary way to integrate OCP with Incident Response solutions is using the Alertmanager plugin for Prometheus. It is done by creating a route and receiver in the Alertmanager configuration.
Cloud management platforms offer a way to abstract Kubernetes management across multiple clusters and clouds. Rancher offers multiple products for Kubernetes, from its namesake main offering to its popular k3s project. It’s a cloud-independent way to create and manage Kubernetes clusters using its own distribution on premises. If you want to host applications across multiple public clouds, these platforms will allow even small organizations to leverage and integrate offerings from providers like GKE and EKS.
Platform9 also prides itself on flexibility as a multi-cloud manager, and it extends to virtual machines in addition to Kubernetes. Combining this functionality with the ability to aggregate logs and integrate with leading application and log management tools, it has the ability to support organizations of any size moving towards a cloud native future.
All of these products typically build their tooling around open-source products like Prometheus and Grafana in order to expose real-time metrics and provide alerting. They will often enhance base open-source projects and tailor them to their offerings (such as Cortex from Weave Works). These custom solutions will still integrate with any modern Incident Response product offerings.