Docker shook the DevOps world a couple of years ago. Containers ready for cloud architecture brought production operations closer to development and helped make microservices the backbone of a more flexible, aggressive approach to building software architecture. The Docker movement gives product teams more freedom in their technology choices since they’re empowered to deploy and manage their applications in production themselves. However, operationalizing Docker can also mean more complexity, an abundance of infrastructure and application data, and greater need for monitoring and alerting on the production environment.
Splunk Infrastructure Monitoring has been running Docker containers in production since 2013. Every single application we manage executes within a Docker container. Along the way, we’ve learned how to monitor our Docker-based infrastructure and how to get maximum visibility into our applications, wherever and however they run.
This is the first in a series of blogs on monitoring Docker containers. In this post, I’ll discuss what’s important to monitoring Dockerized environments, how to collect container metrics you care about, and your options for collecting application metrics.
Asking the Right Question
Even as IT, operations, and engineering orgs come together around the value of and objectives for containers, one question endures: “How do I monitor Docker in my production environment?” The source of confusion here comes from the fact that we’re asking the wrong question. Monitoring the Docker daemon, the Kubernetes master, or even the Mesos scheduler isn’t complicated. It needs to be done, and there are solutions for each of these.
Running your applications in Docker containers only really changes how they are packaged, scheduled, and orchestrated—not how they run. The question we should be asking then becomes: “How does Docker change how I monitor my applications?”
The answer, as is so often the case, is “it depends.” It depends on the dependencies of your environment and is affected by your use case and objectives:
- What orchestration technology do you use?
- What Docker image philosophy do you follow?
- What level of observability can you can get from your Dockerized application?
Better yet, to understand the changes a microservices regime and Dockerized environment might cause for your monitoring strategy, you should first answer these four simple questions. Your answers may differ for each application and your approach to monitoring should reflect those differences.
- Do you want to track application-specific metrics or just system-level metrics?
- Is your application placement static or dynamic? (i.e., Do you use a static mapping of what runs where or do you use dynamic container placement, scheduling, and binpacking?)
- If you have application-specific metrics, do you poll those metrics from your application, or are they being pushed to some external endpoint? If you poll the metrics, are they available through a TCP port you’re comfortable exposing from your container?
- Do you run lightweight, barebone, single-process Docker containers or heavyweight images with supervisord (or something similar)?
Getting Your Containers’ Metrics
If you need system-level metrics from your containers, Docker has you covered. The Docker daemon exposes very detailed metrics about CPU, memory, network, and I/O usage that are available for each running container via the /stats endpoint of Docker’s remote API. Whether or not you plan on collecting application-level metrics, you should definitely get your containers’ metrics first.
The best way to collect those metrics and send them to your monitoring system is to use collectd and the docker-collectd-plugin. For more information, check out our introductory blog post on Monitoring Docker at Scale with Splunk Infrastructure Monitoring.
Collectd on Each Host
The simplest and most reliable way of getting metrics from all your containers is running collectd on each host that has a Docker daemon. Simply configure the docker-collectd-plugin to talk to the local Docker daemon on each host:
With Docker Swarm
If you’re using Docker Swarm, the Swarm API endpoint exposes the full Docker remote API, reporting data for all the containers executed in the swarm. This means only one collectd instance with the docker-collectd-plugin is needed to point at the Swarm manager’s API endpoint. Container metrics from all running containers that you started on your Swarm nodes will be collected:
Once you have your container metrics flowing to your monitoring system, you can build charts and dashboards to visualize the performance of your containers and your infrastructure. Learn about the metrics collected by the docker-collectd-plugin here.
If your monitoring system is Splunk Infrastructure Monitoring, we automatically discover these metrics and provide curated, built-in dashboards to show your Docker infrastructure from cluster to host to container.
What About Application Metrics?
A key challenge with collecting application metrics from Dockerized applications is locating the source of the data. If your applications don’t automatically push metrics to a remote endpoint, you need to know what runs where, what metrics to poll, and how to poll those metrics from your applications.
For first-party software, I strongly recommend that you make your application report its metrics on its own. Most code instrumentation libraries already work this way. Or you should be able to easily add this functionality to your codebase. Just make sure the remote endpoint is easily and (if possible) dynamically configurable.
In Java, for example, Codahale/Dropwizard Metrics is a popular library that is recommendable for instrumenting Java programs. To set it up to report metrics to Splunk Infrastructure Monitoring, include our signalfx-java client library and add a few lines to your application:
Third-party software is where collecting metrics becomes much trickier. Most of the time, the application you want to monitor is not capable of pushing metrics data to an external endpoint. You have to poll those metrics directly from the application, from JMX, or even from logs. In Dockerized environments, this makes configuring your monitoring system quite challenging, depending on whether you have a static container placement or use some form of dynamic container scheduling.
Static Container Placement
Knowing the placement of your application containers, either by configuration or by convention, makes it easier to collect metrics from those applications. Simply configure collectd on each host or from another location to start the collection process.
Depending on the application, you may have to expose additional TCP ports to reach whichever endpoint the application exposes metrics through. In some cases, such as for Kafka, you’ll need to enable and expose JMX. For others, like Elasticsearch and ZooKeeper, a specific endpoint of the API is made directly available.
Dynamic Container Scheduling
If you use a dynamic container scheduler such as Kubernetes or Mesos + Marathon, it’s very likely that you don’t entirely control where your applications execute. Even if your applications leverage service discovery, it can be very difficult to bridge the gap between your metrics collection and monitoring systems. The same problem arises when using server-less infrastructures or pure container hosting providers.
In this situation, we see three solutions to this problem. None is perfect if you want to stay close to the doctrine of lightweight Docker images that execute a single application binary inside the running container. However, all provide a starting point to bridge the gap between metrics collection and monitoring systems.
- Find a way to make your metrics collection system dynamically re-configurable when your container scheduler takes action. This requires a fair amount of engineering effort to build a service that listens to events generated by your container scheduler when new containers start and that reacts to containers coming and going in order to reconfigure your metrics collection system. If you use collectd, this could mean automatically regenerating collectd’s configuration sections and restarting it as appropriate. However, if containers are spun up or go away to often, this may lead to less reliable metrics reporting.
- Execute collectd in a sidekick container. Similar to the first solution, you can listen to events generated by your container scheduler and use those events to automatically start and stop sidekick collectd containers. For each application container running in your environment, an additional collectd container is started with minimal configuration to collect metrics exclusively from the application in the corresponding container. This approach multiplies the number of containers you are running. Thankfully, collectd is lightweight with its minimal configuration of only one plugin to get the application metrics you need. Whenever possible, execute this sidekick container with a placement constraint that will force it to execute on the same physical host as the container your application runs in. This minimizes network involvement in the metrics collection process.
- Execute collectd inside your application container. By bundling collectd inside your application container, you no longer have to deal with the dynamic nature of your application placement. When the application starts, collectd starts with it to report that application’s metrics. A minimal configuration can be tailored and run on localhost, providing the point of view of what’s inside the container. The flipside is that you now have to manage the lifecycle of two processes inside your container, either manually or with supervisord. If your application stops or dies, you’ll most likely want to stop collectd and stop the container. But if collectd dies, you’ll want to restart or stop it if it doesn’t restart, given that no metrics will be collected from this container’s application.
Monitoring Docker itself and getting system-level metrics from your containers is easy with the docker-collectd-plugin. Monitoring the applications that you run inside your Docker containers is where it gets more complex and where the confusion around monitoring Docker comes from.
In the second part of the Monitoring Docker Containers series, we’ll discuss how Splunk Infrastructure Monitoring monitors its containerized infrastructure, the tools used to orchestrate across our various environments, and how we get visibility across all layers of the infrastructure.
To learn more, check out our webinar with Zenefits on operationalizing Docker and orchestrating microservices. I shared lessons from running Docker at scale during the past three years, including what metrics matter for monitoring, how to assign data dimensions for troubleshooting, and strategies for alerting on microservices running in Docker containers.
Learn more about Splunk Infrastructure Monitoring and get a 14-day free trial!