LEARN

Synthetic Monitoring Phases & Strategies

Synthetic monitoring tools have long formed a core part of application performance management and monitoring toolsets.

Yet no matter how familiar you are with synthetic monitoring, there is likely room to get more out of it than you currently are. Indeed, the default approach to synthetic monitoring tends to involve using it reactively: problems occur in production, and your team uses synthetic monitoring to help understand and remediate them.

That’s a start, but it falls far short of its full potential as part of a performance management strategy. Teams must use synthetic monitoring tools proactively — optimizing their systems, rather than merely using synthetic monitoring to fix problems as they arise.

In this article, I’ll look at three basic phases – crawling, walking, and running – that site reliability engineering (SRE) and IT teams typically pass through as they work toward full maturity of their synthetic monitoring strategies. No matter where you are currently, I’ll offer tips on moving forward so that synthetic monitoring helps you optimize, not merely manage, the performance of your software environments.

Synthetic monitoring overview

Synthetic monitoring tools, which measure how applications respond to simulated requests, have enjoyed widespread adoption because they provide visibility that’s difficult to achieve via other means.

Unlike real user monitoring (RUM), which collects data about transactions from production environments, synthetic monitoring allows teams to define the precise conditions they want to simulate when evaluating application performance. Synthetic monitoring makes it easy to test for use cases that may not be as well represented in metrics collected from real-user transactions.

Synthetic monitoring also helps test for problems before they impact end users. If you perform only RUM, you run the risk that you won’t detect critical performance problems until they are already disrupting your actual customers. With synthetic monitoring, however, you can test for and identify potential problems before real customers experience them.

Of course, synthetic monitoring is only one ingredient in a complete application management strategy. RUM, log analytics, distributed tracing and other observability methodologies are equally important. Nonetheless, synthetic monitoring is a must-have technique for any team that wishes to achieve end-to-end visibility into both the applications it deploys and the customer experience it delivers.

(Learn more about synthetic vs real user monitoring.)

Three phases of synthetic monitoring

There are several ways to leverage synthetic monitoring as part of a broader application performance management strategy. To understand these different levels of synthetic monitoring, it’s helpful to think of them as three stages of development.

The crawling stage

The first and simplest stage is akin to crawling. Here, synthetic monitoring is used to collect only the most basic metrics, like uptime statistics. Because these metrics are used primarily for troubleshooting, synthetic monitoring at this stage matters primarily just to the SRE or ITOps team.

Although these metrics are basic, they provide the foundation for deeper insight into the application, such as which services experience critical problems most frequently. As uptime statistics improve, they can also provide proof of the ROI for SRE and IT operations, which in turn helps SRE and IT engineers get buy-in to move onto the next stage of monitoring.

At this stage: Monitoring is limited in scope and value. Monitoring for uptime alone doesn’t help to find and fix performance bottlenecks or understand the wider impact of performance problems.

The walking stage

The next stage in the synthetic monitoring journey can be compared to walking. At this stage, organizations learn that “slow is the new down,” meaning that applications that perform slowly are just as problematic as those that don’t respond at all.

As a result, teams begin using synthetic monitoring to track response rates and errors in addition to uptime. With this insight, they can understand what services are consistently slow, or which ones experience a regression in performance. This enables:

  • The team to proactively detect a service that may fail because it’s getting slower and slower.
  • The organization to determine which types of issues to prioritize, based on the services experiencing the greatest problems and the impact of those problems.

Although the maturity of monitoring operations has increased at this stage, the performance metrics they track remain simplistic and incapable of delivering complete and actionable visibility. Teams might:

  • Collect metrics only from key application services, for example, rather than performing end-to-end monitoring.
  • Measure the total time it takes for an application to complete a request, instead of measuring response rates across individual services as the request moves from one service to another.

At this stage: Teams will know what is slow but they’ll struggle to get to the root cause of performance problems in order to optimize performance. Because monitoring at this stage still focuses on finding problems but doesn’t reveal their root cause, it remains the realm primarily of IT and SRE teams. If the teams lack the monitoring data necessary to pinpoint the code that causes an issue, they can’t collaborate with developers to resolve it.

The running stage

The most advanced stage — and the one that requires the highest level of organizational alignment — is the equivalent of running.

This is the stage where synthetic monitoring reaches full maturity. It’s characterized by a synthetic monitoring strategy that doesn’t collect just generic uptime and performance metrics, but goes deeper by focusing on metrics such as those that Google labels “Core Web Vitals.” These metrics include:

  • Largest Contentful Paint (LCP): How long it takes a page to load from the perspective of the user. This data point may be different from what the application reports as page load time, because browser rendering delays and other issues could lead to slower loads from the user’s viewpoint than from what backend systems report.
  • First Input Delay (FID): How long it takes before users can interact with a page. Here again, the page may appear to be loaded from the application’s perspective, but that doesn’t necessarily mean it’s ready to handle user input.
  • Cumulative Layout Shift (CLS): How consistent the page content remains as the page loads. Content that moves around, or that loads and disappears, leads to poor CLS metrics and a confusing experience from the user’s perspective.

These metrics focus on what the user experiences, which is the most meaningful measure of performance. They can also help engineers pinpoint the most problematic components of a page, such as images that take longer to load than the rest of the content on a page, or a menu that loads quickly but does not accept input immediately.

In turn, they provide deeper visibility into exactly how to optimize performance.

At this stage: Synthetic monitoring moves beyond front-end applications metrics to include data from the application backend. By correlating granular performance data between different types of application components, engineers gain the visibility necessary to identify the root cause of performance problems.

Sophisticated synthetic monitoring also allows developers to participate fully in the process. When teams can quickly link performance issues to code, developers can find and resolve problems within the codebase. In this way, synthetic monitoring at this stage becomes an integrated part of the CI/CD process, allowing developers, IT engineers and SREs to work together to deliver the highest-quality code possible.

Evolving your synthetic monitoring strategy

Running with your synthetic monitoring tools and strategy requires deliberate effort. It’s easy to stop at the walking stage, which enables basic performance management on a reactive basis, without ever reaching the proactive, optimization-oriented running stage.

To move beyond reactive synthetic monitoring and reach the running stage, you should strive to implement tools and workflows founded on the following principles.

 

Testing complex transactions

Getting the most out of synthetic monitoring requires tracking complex, multi-step user journeys. It’s rare for a user to initiate just a single request and then close your application. Users typically initiate an array of transactions during each visit to your site. They might…

  • Search for a product
  • Click on different items for product details
  • Add items to their cart
  • Check out
  • …and so on

Testing each of these transactions in isolation isn’t enough to guarantee an optimal user experience. To get ahead of problems before they affect your users, test the complete flow by scripting the user journey across your app. Simulate all the transactions that users could initiate, and use data produced by one transaction to drive testing for the next transaction.

Answering the “what ifs”

Synthetic monitoring lends itself to experimentation with different variables more so than other observability techniques. Be sure to take full advantage of this capability by using synthetic monitoring to test not just for standard transaction types and user engagement patterns, but also the outlying, “what if” scenarios.

  • What happens if you run your app without a content delivery network (CDN)?
  • How does one release perform relative to another?
  • How does performance change when requests originate from different geographic regions?

Being able to answer questions like these through synthetic monitoring tools will significantly enhance your ability to optimize performance.

Robust, contextual alerts

All synthetic monitoring tools can be used to trigger alerts when an anomaly occurs. But simply receiving an alert that something is wrong is not enough to optimize performance proactively.

Instead, you need robust and contextual data about each alert. Screenshots that show exactly where the error occurred, or tools that trace it to specific source code, help you do this. So does the ability to run the same test from multiple locations to distinguish between localized and global failures. Automatically repeating a failed request to determine whether it fails consistently or is only an intermittent problem is crucial for enabling proactive response, too.

Environment parity

Simulated tests that run before you deploy into production are reliable only if the dev/test environment in which you run the tests reliably mirrors your production environment. If it doesn’t, you end up performing synthetic monitoring under conditions that may not accurately represent production, which greatly undercuts your ability to preempt issues that could impact real users once your release is deployed.

Address this issue by ensuring that test/dev resembles production as closely as possible. Containers can help achieve this parity by providing identical deployment environments for both testing and deployment. But your synthetic monitoring tools should also allow you to emulate production environments closely by, for example, initiating requests from the same geographic regions where your actual users are located, and testing across a variety of user device and operating system configurations.

Synthetic monitoring supports application management

Synthetic monitoring is a powerful part of any application management workflow. Exactly how much value you leverage from synthetic monitoring tools, however, depends on how many advanced features those tools offer for gaining actionable insight into the complex journeys your users take as they interact with your applications.

Testing only individual requests or focusing on overall uptime or response rates deprives you of the ability to take a proactive approach to performance management or the user experience.

Take synthetic monitoring to the next level with Splunk Synthetic Monitoring.

What is Splunk?

The original version of this blog was published by Billy Hoffman. This posting does not necessarily represent Splunk's position, strategies, or opinion.

Billy Hoffman
Posted by

Billy Hoffman

For over 15 years Billy has spoken internationally at conferences and written 2 books on how to build fast and secure websites. While CTO at Rigor, Billy on helped customers create strong performance cultures and understand the importance of performance to the business. Following Rigor's acquisition by Splunk, Billy focuses on improving and integrating the capabilities of Splunk's APM, RUM, and Synthetics products.