On Tuesday June 8th, the Content Delivery Network Fastly experienced an outage that made large swaths of the web unavailable for nearly an hour. To focus on the positive, this outage can serve as a wakeup call for Observability teams, because it shows how much modern sites depend on resources beyond their immediate control, and how hard it is to "observe" these kinds of issues with an incomplete Observability mindset. In this blog post, I will talk about the Fastly outage, examine how traditional monitoring technologies would have responded to that outage, and show how adopting Digital Experience Monitoring inside your Observability practice is crucial to detecting and responding to these types of issues.
Inside a CDN Outage
Fastly is a Content Delivery Network (CDN). While today’s CDNs serve many different functions, their primary job is to cache copies of a website's content or resources in data centers around the world, so that this data is geographically closer to a website's visitors when they request it, thus reducing the latency and improving performance. CDNs do this by sitting "in front" of a website or API. For example, when someone accesses splunk.com, the requests for this site’s images, CSS, fonts, and perhaps even its HTML markup are all served by the CDN instead of by the origin server.
In the diagram below, we can see how the CDN sits between the site visitor and the origin server. When the browser requests main.js, the CDN already has a copy, and returns it more quickly than the origin server would have. When the browser makes an API call, the CDN passes that request through to the origin server.
On Tuesday, due to an internal issue, Fastly could not respond to any requests. These requests didn't pass through to the origin server so it couldn’t service those requests either. The diagram below illustrates this breakdown:
What a User Experiences During a CDN Outage
The Splunk Observability team has a sample online store, BroomsToGo, that was actually impacted by the CDN outage this week. It runs on Shopify, which uses Fastly, so I was able to use our Observability tools to see first-hand what users experienced. Below is a waterfall chart from Splunk Synthetic Monitoring that shows the requests the browser made when it tried to load BroomsToGo:
What an Incomplete Observability Practice Sees During a CDN Outage
Consider the impact if BroomsToGo were a legitimate store, instead of a fun sample application. Would the team have been able to detect this issue using only a few of the traditional monitoring tools?
Imagine Melanie, an SRE for BroomsToGo, who is sipping her coffee when Ethan from Bizdev calls in a panic: "The website is down! No one can check out! We are losing thousands of sales!"
Impossible! Melanie thinks as she pulls up her traditional monitoring tools. Everything is green. There are no alerts posted in the Ops channel in Slack. Surely her infrastructure or application monitors would have detected something?
Unfortunately not. The CDN is sitting in front of some or all of the application. A CDN failure therefore impacts the ability to "observe" the outage via traditional means. Here is how traditional monitoring tools would have handled this outage
- Traditional Infrastructure Monitoring (IM): IM provides insights on the health of your infrastructure, such as containers, servers, or cloud resources. Since a CDN outage prevents most, if not all, visitor traffic from accessing your infrastructure, SRE teams using IM would have seen green dashboards with no issues. No company-controlled infrastructure was stressed.
Where Melanie would have seen a problem is if she was watching her site analytics. Checkout rates would be going down, depending on how widespread the outage is. Traffic numbers and engagement numbers would also be going down. If she had alerts on social media about her brand, she would see people starting to complain as well.
Expanding Observability with Digital Experience Monitoring
Melanie mistakenly assumed that the infrastructure and applications she controls represent all important aspects of the app's health and availability. Unfortunately, this is not the case. Modern applications have spread beyond your control. They nearly always include third-party components and run in opaque environments. In some cases, such as when marketing adds a new chat widget to the website, SREs might not even know of all the infrastructure, apps, and dependencies that make up their site. While legacy IM and APM are fantastic tools to help detect and diagnose issues, these tools cannot measure what they cannot see, and modern applications have a lot of surface area beyond SREs’ control.
In fact, because I used Splunk Synthetic Monitoring on the BroomsToGo site, I got an alert about this outage. Synthetic monitoring detected an increase in client-side errors exactly when this week’s CDN outage occurred.
In the screenshot above, the teal-colored line represents the First Contentful Paint and the black line shows the count of browser errors encountered while downloading resources. We can see that at the moment when the outage started, assets such as CSS and JS failed to load, thus increasing the error count. Meanwhile, the paint time actually improved, because all those request failures meant there was less content to draw!
Of course, DEM solutions cannot see inside your infrastructure and application (in fact, if a client, be it a real user or a synthetic browser, can see inside your app, you have bigger issues!). This is why DEM must be combined with IM, APM, and other solutions in a suite of Observability such as Splunk Observability Cloud. This combination provides comprehensive visibility into your application, infrastructure, and user experience, keeping you on top of issues no matter where in your stack they arise.
The Fastly outage this week is a clear example of how content and infrastructure beyond your control can make your site fail and impact your business. By expanding your point of view and measuring experience from the client's perspective using Digital Experience Monitoring, you can detect these types of issues and respond quickly to minimize their impact. As you develop a comprehensive Observability strategy, remember to include Digital Experience Monitoring.
To get started with Splunk Synthetic Monitoring, sign up for a free trial today.
Additional Splunk resources: