By Mike Cohen
In my last post, I introduced the idea of monitoring flows. I mentioned that flows capture a set of metrics (traffic, errors, latencies) transparently along with information about both the sending and receiving processes, containers, services, etc. In a world of numerous, function-specific microservices, they represent a powerful new dataset which deserves our close attention.
One of the big reasons flow data is powerful is that its naturally comprehensive, covering 100% of the interactions between services. No code changes required! That's obviously a huge plus from an observability perspective.
But what can you do with flow data? A few major areas stand out.
Architecture, High-availability strategy, and Environment isolation. Flow data provides a real-time, complete picture of the application architecture: it maps every service dependency in the system. This is certainly useful for ramping up team members or making informed operational decisions. But due to its real time nature, it can also help you discover problems before they turn into incidents.
Do you have a service that has a dependency you’re unaware of, or one that is undesirable? Is there a dependency between availability zones or regions that negates your high availability strategy? Are two supposedly isolated environments interacting (e.g., production and staging)?
Flow data can be instrumental in answering these questions… and providing your team a single source of truth about your architecture.
Pinpoint problems faster. Bugs, configuration errors, and incompatibilities inevitably reach staging and production environments. With tens or hundreds or services, it can be complex to figure out where a problem is really occurring, especially if it's a subtle interaction between two services. Flow data allows for quick discovery of unhealthy pairs of services by showing you abnormal traffic patterns. Moreover, it helps you understand the blast radius of the problem: which services are affected, which versions, which zones/regions. Is a shared service being abused by one of its hundred clients? Which one??
Flow data can also help you assess the health of the infrastructure and 3rd party, managed services as well. A wide variety of problems can crop up in shared infrastructure including intermittent network problems, host reboots, DNS issues, or zone-wide failures. Flow data can help you quickly discern if the problem is occurring within your application or outside it. And it can help you zero in on the instances or containers affected so you can mitigate the problem.
Cost, usage, and consumption. Public cloud providers have done a good job training us to run our services across multiple availability zones for high availability. But at the same time, they charge different rates for traffic between regions, zones, and instances in the same zone. This isn’t too scary in a monolithic application with a single IP address... but how can you optimize or even manage this cost if a service is spread across tens or hundreds of instances or containers? Flow data can paint a detailed picture of the behavior between services and show you which ones are generating the most expensive traffic.
Hopefully this gives you at least an initial picture of the kinds of questions flow data can answer for you. We’ve found it can be a great way to monitor distributed applications.