As a principal engineer on the Splunk Real User Monitoring (RUM) team who is responsible for measuring and monitoring our service-level agreements (SLAs) and service-level objectives (SLOs), I depend on observability to measure, visualize and troubleshoot our services. Our key SLA is to guarantee that our services are available and accessible 99.9% of the time. Our application is as complex as any modern application — multiple micro-frontends backed by a shared GraphQL server orchestrating requests across a broad range of microservices. The success of our user experience largely depends on ~850 million spans per minute from our customer apps getting ingested into our ingest pipelines and being processed downstream by our systems in a timely manner to make insights available to our customers via our UI application. We are committed to our SLAs and SLOs and need to be alerted on time when we don’t meet them to be able to take swift remedial action.
Here is how we used Splunk Observability Cloud to detect a critical incident in production and analyze the root cause in a matter of minutes.
Users Aren’t Seeing Their Application Dashboard, Resolve ASAP!
The incident was discovered when our on-call engineer received alerts from 2 sources:
1. The RUM detector for GraphQL errors fired an alert indicating that the number of failed GraphQL requests was above our acceptable threshold:
Alert fired by the RUM detector when RUM GraphQL errors went above the defined threshold
2. The real browser checks configured in Splunk Synthetics that test the RUM UI periodically, triggered an alert after 3 consecutive failures:
Alert received from Splunk Synthetics after 3 consecutive real browser check failures
The alerts provided all the information required to understand what the actual problem was:
- Splunk RUM couldn’t load the “Application summary dashboard”.
- Our customers were viewing an error page as captured by the failed synthetic test.
- The issue was specific to only one of our production environments - us1.
- Our GraphQL requests were failing with a 403 HTTP response code.
- The problem was intermittent because some synthetic test runs actually succeeded in us1 even after the alert was triggered.
Also, drilling into the failed test run in Splunk Synthetics helped visualize the browser requests that were triggered for that specific run. It became quite evident that all the GraphQL requests in that run failed with a 403 response code.
Browser requests for the failed run viewed in Splunk Synthetics
Assess the Impact
The next step was to identify the scale of the issue to update our status page for our customers. We were able to immediately pivot to Splunk RUM’s Tag Spotlight experience where we could effortlessly view the aggregate error counts on our GraphQL endpoint without scouring raw data or crafting complex queries. The Tag Spotlight experience provided a detailed analysis of the errors across several dimensions like environment, HTTP status code, application, etc. We were able to confirm that our other production environments were stable and that the problem was truly specific to the us1 environment.
Splunk RUM Tag Spotlight experience showing 403 errors for GraphQL API calls in the “us1” environment
Find the Root Cause
The immediate question was to figure out what was going on with our GraphQL server. We were anxious to find answers to these questions:
- Was there an issue with our GraphQL server?
- If yes, was it affecting other APIs as well?
- If not, where was the real problem?
To obtain a big-picture overview of our back-end services, we opened Splunk APM’s dependency map to explore all services upstream and downstream of our shared GraphQL server. It became quickly clear that the problem was with our shared internal gateway service and not with the GraphQL server itself.
Splunk APM’s dependency map showing errors in the shared gateway service (upstream of GraphQL server)
Additionally, Splunk APM presented some example traces to obtain insights into the specific components within the gateway that were throwing a 403 Forbidden error. The example traces greatly helped to escalate the issue with the team that owned the internal gateway service thereby eliminating the need to look for the needle in the haystack.
Splunk APM’s trace view displaying a trace corresponding to the 403 error
Fix and Finish. All is Well Again.
We were able to partner with the owner team and prioritize a stopgap solution as soon as possible. Once the solution was rolled out, the Splunk Observability products also gave us the ability to validate that the incident was resolved 100%.
Splunk Synthetics showed successful test runs in us1 after the fix was rolled out.
Splunk Synthetics displaying successful test runs after a fix was rolled out
Splunk RUM’s Tag Spotlight page began to report only 200 HTTP status codes for all /rum GraphQL requests which was a huge relief!
Splunk RUM Tag Spotlight displaying 200 HTTP status code for all GraphQL requests after the issue was resolved
Hitting SLAs & SLOs With Observability
As an engineering team that owns products used by real customers and deploys features and enhancements to production frequently, we’ve taken a few measures to monitor our application using Splunk Observability Cloud — these measures have had a large return on investment and are still proving to be helpful to meet our SLAs and SLOs.
- Set up real browser checks in Splunk Synthetics – these checks run every 5 mins across all of our production environments, access critical parts of our application and notify us of any issues or unexpected slowness.
- Instrument our web application with Splunk Real User Monitoring – we send telemetry data to Splunk RUM which in turn provides us insights about the performance and the health of our front-end user experience.
- Send traces from our back-end microservices into Splunk Application Performance Monitoring – enables monitoring of our distributed systems and provides full-fidelity access to all of our application data.
- Build dashboards on important metrics to provide useful and actionable insights into our application – helps monitor error rates, latency, web vitals, etc.
- Set up detectors that trigger alerts when certain conditions are met – notifies on-call engineers when specified thresholds are exceeded.
Observability to Learn, Refine and Guide Future Development
The nice thing about getting started with our observability journey was that we could start small by focusing on the critical aspects of our application, learn and refine our methods as we explored and observed things in production. With incremental efforts, we were able to add value and obtain a shared, comprehensive understanding of our system’s architecture, health and performance over time.
Also, observability with Splunk Observability Cloud helped create more accurate post-incident reviews as all involved parties were able to examine documented records of real-time system behavior instead of piecing events together from siloed, individual sources. This data-driven guidance helped our teams understand why incidents occurred so we could better prevent and handle future incidents.
If you’re interested in empowering your teams to be data-driven and optimizing performance and productivity, try Splunk Observability Cloud today.