This post was co-authored by Sharmin Yousuf, a summer 2021 product marketing management intern focused specifically on search engine optimization for Splunk’s Application Performance Monitoring and Digital Experience Monitoring pages.
There is nothing worse than waking up to an angry customer complaining that your website is failing to accept their payment at checkout. This may be worrying for some since payments not being processed can be equivalent to losing money; however with Tag Spotlight, this should be a relatively quick problem to dissect. The key question here is whether this is an issue that all our customers are facing or an isolated event.
What is Tag Spotlight?
Before we dive any further, let’s address the elephant in the room. What is Tag Spotlight, and how will it help solve problems? Tag Spotlight is a critical component of Splunk APM, enabled by its infinite cardinality and full-fidelity architecture. Essentially, it is a one-stop-shop to understand service behaviour, allowing users to visualize errors/latency for any given service through the lens of all the tags affiliated with it.
Want to skip the how-to and see for yourself? Start a free trial of Splunk Observability Cloud instantly. No credit card required.
Now that we have a fundamental understanding of Tag Spotlight, let’s look into our service map to see precisely where the problem lies. Splunk’s service maps provide a high-level view of all the services of an application and visually display how the services interact with one another. The service map updates automatically, allowing DevOps teams to understand their apps in real-time.
In the example below, we have an online store with several inter-connected microservices. The service map uses color-coding to depict the services currently experiencing errors, enabling SREs to identify the root cause of an issue at a glance.
1: Interactive service map displaying errors that have occurred in the selected time frame (200 min).
2: Service Map Legend
Based on our service map (Image 1), we can immediately identify the culprit as the path from our front-end service to the checkout service to finally the payment service. By selecting the payment service, we can see a top-level view of Tag Spotlight (Image 3), where key details such as errors, latencies and the version number are displayed. In our case, several errors exist in the latest version (v350.10), which could be the source of our problem (Image 4).
Nonetheless, let’s leverage Tag Spotlight to conduct a deep-dive and determine what exactly is causing the error and whether our DevOps teams will need to allocate resources to resolve it.
3: Paymentservice’s top-level view of Tag Spotlight in the right pane.
4: Paymentservice’s top-level view of Tag Spotlight in the right pane filtered by version number.
5: Tag Spotlight view of Frontend service.
We can start by clicking into the Tag Spotlight view of the frontend service. Here we see several errors (pink peaks on the graph in Image 5) that our customers experienced over the past 200 min. To dissect these errors further, we can filter by removing the successful requests and specifying the exact span tags we are interested in, i.e./ cart checkout failures (Images 6, 7).
6: Tag Spotlight view of Frontend service filtered by errors.
7: Tag Spotlight view of Frontend service filtered by cart/checkout endpoint.
Upon selecting an error (pink peak), we can see the exact trace id, error-initiating operation, start time and all its impacted services (Image 6). In an alert storm situation, where there are errors galore, Tag Spotlight also provides the link to the full stack of traces by selecting the “View More Traces in Full Trace Search.” Here, you will be able to slice and dice your trace data with any combination of tags you choose to better understand and assess the errors. With Splunk APM’s unique NoSample™ full-fidelity tracing, you can rest assured that all your traces will be ingested and stored with no sampling whatsoever, allowing you to conduct a thorough analysis.
For our purposes, let’s track down trace #130f2bb7928cbfd0 across the impacted services. From the frontend service itself, we can see that a customer hit the cart/checkout endpoint at 10:10 a.m., and the entire process lasted for 54.7 minutes (Image 6). Following through, this same trace was found exactly 54.7 minutes later, at 11:04 a.m., in the checkout service (Image 8).
8: Tag Spotlight view of Checkout service.
9: Failure to charge card errors in checkout service.
Taking a closer look at the errors in the checkout service, we can observe that they are all affiliated with the /placeOrder endpoint. Hovering over the error message tag specifically, we can confirm that all the errors are identical and due to a failure in charging customer credit cards (Image 9).
Similarly, the latency view of Tag Spotlight provides detailed insight into how long a process has been running (Image 10). The percentiles (p50, p90, p99) help quantify the number of users affected by a slow endpoint and identify SLIs (service level indicators). In our case, the checkout service has been awaiting a response for 54.7 minutes.
10: Latency corresponds to errors in checkout service.
11: Tag spotlight view of payment service.
Continuing our investigation into the payment’s service, we are immediately greeted with a storm of errors. To visualize the behaviour of trace #130f2bb7928cbfd0 better, we can zoom down to the minute of origin (10:10 a.m.) to get a detailed understanding of the trace (Images 11,12,3).
12: Tag spotlight view of errors in the selected time frame (highlighting trace #130f2bb7928cbfd0).
13: Tag spotlight view of latency in the selected time frame (highlighting trace #130f2bb7928cbfd0).
Now that we have a granular view of our trace, we can go ahead and filter for the exact endpoint causing the error, /Charge. Here, we see that the error is not restricted to tenant level. The error persists within all three tiers - gold, silver, bronze. This is definitely a cause for concern. With all tenant levels facing errors, it is clear that this is not an isolated event but an issue with our payments service.
Zooming back out to the original timeframe (last 200 min), we see that the errors only exist in v350.10 (Image 14). We can go one step further and filter by version tag to verify the behavior in each version. Version 350.9 appears spotless with 0% errors (Image 15), whereas v350.10 renders 100% errors (Image 16).
14: Tag spotlight view of payment service in the past 200 minutes, including all active versions.
15: Tag spotlight view of v350.9 in the past 200 minutes (0 errors in 396 requests).
16: Tag spotlight view of v350.10 in the past 200 minutes (818 errors in 818 requests).
In conclusion, it appears that the root cause of the issue is in v350.10 of the payment service across all tenant levels. We can temporarily address the situation by diverting customer traffic to v350.9 while our payment service owners work on remediating v350.10.
How Can I Get Started with Tag Spotlight?
Tag Spotlight is a key feature of Splunk APM, which is a part of the Splunk Observability Cloud. You can sign up to start a free trial of the suite of products – from Infrastructure Monitoring and APM to Real User Monitoring and Log Observer. Get a real-time view of your environment and start solving problems with your microservices faster today!