With Kubernetes emerging as a strong choice for container orchestration for many organizations, monitoring in Kubernetes environments is essential to application performance. Poor application/infrastructure performance impact in the era of cloud computing, as-a-service delivery models is more significant than ever. How many of us today have more than two rideshare apps or more than three food delivery apps? With the vast competition, customers will quickly draw to applications that work and perform the best.
Identifying issues in a microservices environment can also become more challenging than with a monolithic one, as requests traverse between different layers of the stack and across multiple services. Modern monitoring tools must monitor these interrelated layers while efficiently correlating application and infrastructure behavior to streamline troubleshooting.
What About Open-Source Options for Monitoring Kubernetes?
There are several options when it comes to open-source possibilities for monitoring. These options include:
- Kubernetes health checks with probes
- Metrics API and Metrics Server
- The Kubernetes Dashboard
- Prometheus metrics
In some cases, these tools may be sufficient to help us understand our deployment with high-level performance metrics but also can lack the sophisticated performance analytics capabilities and persistence for historical trends to help maximize performance and cost for your deployments.
How Can I Maximize the Performance of a Kubernetes Deployment?
To maximize performance for Kubernetes, understanding both your microservices infrastructure needs and metrics is important. Optimizing your pods, which run on nodes, then grouped together by clusters, is key to a successful implementation. You’ll want to make sure that you are keeping an eye on metrics that include which nodes pods are scheduled to, their CPU and memory consumption, and resource limits for each of your workloads, which I'll discuss below.
Splunk Infrastructure Monitoring integrates with every layer of your environment to provide end-to-end observability for Kubernetes environments using the Splunk OpenTelemetry collector. The collector provides integrated collection/forwarding for all Kubernetes telemetry types and is deployed using a Helm chart for Kubernetes. Once deployed, navigate to infrastructure and click Kubernetes to find the essential metrics to monitor your Kubernetes deployment, starting with the Kubernetes cluster map.
The Kubernetes Cluster Map allows you to view displays of your Kubernetes infrastructure in an interactive cluster map. The level of detail shown on the map is dynamic and depends on the number of elements in your environment specified through the filters or whether you zoom in on to drill down for more detail. This gives us an excellent overview of cluster resource usage, node, and pod availability, and health, as well as if we have any missing or failed pods. I personally like to start with the Kubernetes Analyzer, which allows me to view suggested filters about my deployment. In this example, the analyzer provides filters related to high memory nodes, nodes containing pods found not ready, and more.
The analyzer allows us to quickly jump to the namespaces that may have possible deployment issues. When selecting a namespace affected with high memory utilization, the Kubernetes cluster map filters to the cluster affected by high memory utilization and only shows the nodes and pods in this state. This filter allows me to drill down and see what’s going on with each node. In this example, we can quickly see that 1 of the 2 pods on this node is significantly using way more memory than it should be.
With the affected node identified in the cluster, I can easily filter through the Kubernetes navigator to learn more about the workload, node, and pod details. Navigating to Node Details, I can filter by the affected node and display detailed charts. The example below shows the affected node’s details and the containers running on this node.
Moving over to Workloads, I can quickly filter by the affected Workload Name and quickly discover other pods deployed with this workload. In the example, we can see that 6 other pods run this affected workload in question autobot.
With the Workload (App) name in mind, we can navigate to Pod Detail to view detailed information about each pod running the Workload. From the example, we can see details from each pod running the workload. One critical metric that comes to mind is the Memory % of Limit is showing -% reflecting that this workload, autobot was deployed with no resource limits.
Resource Limits define a hard limit of resources (CPU and Memory) a workload can take when deployed and make sure the process doesn't consume all resources in the node.
How Can We Fix It?
Resource limits are set within the Kubernetes deployment file. In most cases, we find that developers will include both a request and limit for most applications. Request define guaranteed resources the application must have when deployed with Kubernetes. Kubernetes will only schedule (deploy) workloads on a node that can give it the available resource requested. Limits define the value a workload is limited to. The workload is only allowed to go up to that limit of resources and restricted to surpass that limit.
Here is an example of what a Kubernetes deployment file looks like when requesting and limiting resources.
- name: httpgooglechecker
After the pod is scheduled (deployed) with the example workload, we can use Splunk Infrastructure Monitoring to quickly identify if resource limits were set on the pod when scheduled and the active resource metrics used based on those limits.
When it comes to CPU limits, unless your workload is specifically designed to leverage multiple cores, it is usually a best practice to keep the CPU request at 1 core or below and run more replicas to scale it out. This results in more flexibility and reliability. For memory resources, they are defined in bytes. In most cases, you give a mebibyte value for memory, but you can provide anything from bytes to petabytes.
When configuring resources, it is essential to remember that if both the CPU and memory request is larger than the amount of memory on your nodes, the pod will never be scheduled (deployed) by the scheduler.
Want to try this yourself? Feel free to – you can sign up to start a free trial of the Splunk Observability suite of products – from Infrastructure Monitoring discussed here to APM, Real User Monitoring, and Log Observer. Get a real-time view of your infrastructure and start solving problems with your microservices faster today.