IT

In the Digital World Where Agility Matters, Is Your APM Solution Slowing You Down?

This is the third chapter of our multi-part blog series on the shortcomings of traditional APM solutions for monitoring microservices based applications.

Previously, we covered:

This post explains how the alerting and troubleshooting capabilities of traditional APM do not address the evolving requirements of monitoring microservices based applications.

A highly-functioning alerting system is the starting point for problem-resolution. Traditional APM vendors have built mature alerting functionality, which works fine for monolithic apps. However, microservices architectures pose significant alerting challenges for traditional APM solutions because of the dynamism and ephemerality of container environments and the distributed nature of microservices.

Fatigue and overhead from alert storms

Traditional APM tools are designed to collect and report on performance data at the individual component level. This works well when your applications are monolithic and run entirely on a single runtime, but cloud-native architecture significantly increases the number of components to monitor as breaking up monolith results in multiple microservices that also run on many more containers. The traditional approach of alerting on an individual component basis is a recipe for alert noise because a performance issue with a particular component often creates a domino effect with upstream components causing additional alerts repeating indiscriminately resulting in the alert storm. Alert storms inhibit the triage process to the point where getting no alert is perceived as the lesser evil.

According to Gartner, Inc:

“Most APM solutions were designed for a prior generation of applications that were monolithic and long-lived. These approaches are ill-suited to the dynamism, modularity and scale of today’s emerging microservice-based applications.”
— Gartner

In microservices architectures, alerts should be highly contextual with topology awareness and correlation. For example, alerts from upstream services should automatically be muted if a downstream service is deemed as having a performance issue.

Missed Alerts for Outlier Anomalies

You can’t alert if you don’t see an issue to begin with. Traditional APM tools use head-based sampling, which takes a random approach to analyze trace data. These traditional APM tools will fail to alert on all outliers and/or intermittent issues because they randomly sample transactions for performance analysis. This random sampling approach is why alerts are missed even when the end-user experience is being impacted. We covered the shortcomings of this random, head-based sampling approach in a previous blog.

“Your alerts should tell you about a performance problem before your customers will. The tools that used to work ten years ago are no longer sufficient to monitor p99 cases in distributed systems because these tools do not see everything across the system.”
— Sr. DevOps Engineer, Digital Marketing Platform Company


Slow Alerts on Detected Anomalies

Traditional APM solutions require several minutes before they notice a performance deviation and even more time before they fire an alert. This is because traditional APM solutions are based on a batch and query architecture that is high latency and becomes even slower as the number of dimensions you want to consider for alerting grows.

Siloed Perspectives for Infrastructure and Application

Today’s microservices environments are increasingly dynamic, modular, ephemeral, and loosely coupled, making it difficult for domain-specific traditional APM solutions to provide a unified, single-pane-of-glass view across infrastructure, platforms, and application monitoring. Even if infrastructure and applications monitoring capabilities are provided by the same vendor, users are expected to connect the dots and manually correlate events as in that APM and Infrastructure performance metrics are usually displayed in separate tabs without automatic correlation or context.

A fragmented application and infrastructure perspective from APM vendors leads to lengthy time consuming war-room situations during the root-cause analysis process. There is nothing inherently wrong in creating a war room, but the APM tool should streamline collaboration.

A next-gen APM solution for monitoring microservices can solve this problem by providing an integrated application and infrastructure view from a single-pane-of-glass – all correlated and within context.

Lack of Prescriptive Troubleshooting

Traditional APM tools lack cross-domain analysis to recognize patterns and the understanding of causal relationships across distributed systems. As such users are expected to examine individual trace data manually and arrive at ‘aha’ moments themselves. Leveraging data-science, next-gen APM tools need to be able to recognize the underlying performance patterns and surface those to DevOps teams for expediting troubleshooting.

Next Up

We are excited to join AWS at the largest gathering of the global cloud community at AWS re:Invent. We would love to share how our customers are leveraging SignalFx to quicken their path to problem resolution and reduce MTTR. Learn More!

Amit Sharma
Posted by

Amit Sharma

Amit Sharma is the Director of Product Marketing at SignalFx. He has over ten years of experience in software development, product management, and product marketing. Prior to joining SignalFx, Amit led product marketing at AppDynamics and Cisco. He did his MSCE from Arizona State University and an MBA from UC Berkeley Haas School of Business. Maxime has been a software engineer for over 15 years. At SignalFx, Max is the architect behind our Microservices APM offering, and spent several years working on the core of SignalFx: its real-time, streaming SignalFlow™ Analytics. He is also the creator of MaestroNG, a container orchestrator for Docker environments.
TAGS

In the Digital World Where Agility Matters, Is Your APM Solution Slowing You Down?

Show All Tags
Show Less Tags

Join the Discussion