In a microservices world, modern architectures are constantly evolving. Services scale up and down to meet application demands while development teams build, test, and deploy their code to meet nimbler product cycles. Whenever a service endpoint changes, the registry needs to know about the change.
Service registration establishes a process for who publishes or updates the information on how to reach each service. Third-party registration is commonly used to poll or check which microservice instances are running and to automatically update the service registry. And Apache ZooKeeper is a key tool in managing a distributed, microservices architecture.
When you use collectd and the collectd-zookeeper plugin, SignalFx provides built-in ZooKeeper monitoring dashboards displaying useful production metrics at the node, host, and cluster levels. Key ZooKeeper metrics include node count, packet count, latency, watch count, data size, and open file descriptors.
From SignalFx’s experience monitoring ZooKeeper in production, there are four primary indicators to manage a healthy ZooKeeper service: disk usage, request metrics, active connections, and total znode count. In most cases, changes in these indicators occur at the node level, as most ZooKeeper clusters tend to be small.
Disk Usage on ZooKeeper Instances
When disk usage is properly managed, ZooKeeper can have months or years of uptime. ZooKeeper contains files that are persistent copies of the znodes stored as snapshots and transactional log files. As changes are made to the znodes, these changes are appended to the transactional log, and, eventually, a snapshot of the current state of all znodes is written to the file system.
However, ZooKeeper becomes non-operational when disk capacity on a node runs out due to the high volume of snapshot data and transactional log data, and losing ZooKeeper is critical to the overall operations of your environment.
Alert on Leading Indicators
Disk usage should be consistent and grow in parallel across all znodes of a cluster. However, unexpected increases in disk usage over a short amount of time for one ZooKeeper host indicates increased writes to disk memory. Because snapshots are only deleted after a certain time period, the sudden increase in volume of snapshots written to disk impacts the remaining availability of disk space on the host.
Creating alerts on this leading indicator will result in meaningful notifications as patterns emerge at the service level. An alert for one host often indicates that other hosts are nearing a similar issue, and running out of disk capacity for a ZooKeeper cluster is an early indication of service failure.
Within SignalFx, a warning alert at 50% usage and a critical alert at 80% usage will help the operations team or service owner in the development organization address the trend before it leads to a performance issue in production.
Restarting ZooKeeper nodes can upset other service components in the environment. Should it be required to restart one or more ZooKeeper nodes, allow enough time for leader election to happen between each node restart and verify that the rest of the stack is healthy.