By Mike Cohen
Anyone running a mission critical application using microservices understands the feeling of dread when your pager goes off notifying you your application is down. After mumbling a few expletives about the redundancy you thought you had, you cross your fingers and hope that you can solve the murder mystery to figure out what piece is broken and why.
You’re not alone! Hundreds of engineering hours are lost daily in attempts to unravel webs of microservices. The harsh reality is that monitoring distributed applications is fundamentally more complex than their monolithic predecessors.
In a monolithic world, host metrics like CPU, memory, and I/O performance told you a lot about overall system health. After all, that host was where all the application code was running. For deeper analysis, high overhead APM tools allowed you to instrument a single Rails or JVM instance to pick apart specific performance problems.
But what happens when you have tens or maybe hundreds of microservices, written operated by different teams? And when these microservices depend on a growing number of external managed services critical to your application’s performance and reliability? And of course, when you’re running on a cloud infrastructure into which you have minimal visibility where subtle failures (host reboots, network slowness, DNS issues) are the norm? Suddenly, those host metrics and APM tools are marginally useful...at best.
A new approach is required to give us true visibility into microservices. Operations teams need a way of getting a complete picture view of their distributed application in real time without changing their code and without slowing down their production systems.
The question is how? Services (and even more so individual containers) tend to contain relatively small logical “pieces” of the deployment so traditional metrics don’t really tell us much about the overall behavior of the application. Instead, let’s focus our attention on how these services communicate with each other -- ie. the “flow” of data between them. By monitoring traffic rates, latency, and error rates on every connection, and correlating these measurements with metadata like service names, processes, container IDs, availability zone, etc., you can construct an incredibly accurate picture of a distributed application. This is the idea behind flow monitoring and it’s complementary to existing observability techniques based on metrics, logs, and tracing.
Critically, flow data is readily available without changing your application code, making it easy to deploy with 100% coverage. Modern operating systems already track this information and a number of tools, including eBPF, allow you gather it with negligible overhead. And service proxies like Envoy can augment this data further.
Furthermore, flow data provides insights that extend beyond the services you operate to external, 3rd party managed services. The flows arriving from remote systems, which can be identified by DNS or service discovery, carry with them information about their host systems availability, load, and performance. This provides you a means of observing external services without adding any instrumentation or relying on provider reporting.
At Splunk, we are strong believers that flow data can change the way SREs approach observability and we’ve been busy building tools to help you get there. If you want to learn more, visit our webpage www.Splunk.com/NPM