One of the biggest challenges that DevOps teams face is how to connect their efforts with the priorities of business leaders. In conversations we’ve had, developers and SREs discussed how they need to show business and engineering leaders that they are investing their time solving the right problems, and how solving these problems lead to overall better business outcomes. Business and engineering leaders see this same problem from a different angle — they consistently express their desire to know and understand their business KPIs in real-time, and to understand how the different microservices are affecting these KPIs.
The legacy application performance monitoring (APM) vendors approached this by tying important business KPIs to key transactions in the applications, and then monitoring the latency and error rate of those transactions. Their approach was designed for the old world in which they still live, and they defined business transactions based on their entry point to the system. That approach worked well for legacy monolithic applications, however, the world has evolved and new applications are based on microservices and cloud, in which each app is composed of tens or even hundreds of microservices. In this new and exciting world the entry point-based approach of the legacy vendors is irrelevant — transactions that correspond to totally different business KPIs can have the same entry point, but totally different downstream services.
This is fundamentally an observability problem, and requires an observability approach to solving it. When dealing with microservices, a much deeper level of understanding is required. Splunk APM is the ONLY APM solution that can track important business KPIs in a microservices-based environment. We call this feature Business Workflows, and it provides the capability to group together different microservices that compose important business transactions or KPIs. Before diving into more details about this feature, let me give an example.
Consider the application below — it is performing two main functions: catalog and checkout. However, there is only one entry point into the system — the API service. Measuring the latency and error rate of the API can provide some information, but cannot provide business and engineering leaders with insight as to how checkout-related transactions or catalog-related transactions are doing. Similarly, from the perspective of the developers, the teams that own the Authorization and the Idp services, which are used by both types of transactions, need to understand how their services are impacting these two KPIs.
Image 1: An application composed of multiple microservices
Measuring Important KPIs with Business Workflows
Using an easy, built-in configuration wizard, the Business Workflows capability groups together relevant end-to-end traces/transactions based on any microservice, tag, or based on any initiating operation. Defining workflows that correlate directly with important business KPIs provides a common metric for both business and engineering leaders and the DevOps teams. Once business workflows are defined you get out-of-the-box ready dashboards that show the top workflows in terms of RED metrics (Rate, Error, Duration) and also dashboards that track the RED metrics for individual workflows. These charts also provide seamless, context-specific linking to the service map and to create workflow-based alerts.
Based on the example discussed earlier, we can naturally create two business workflows based on Checkout, and Catalog. We can also get an even deeper level of granularity by creating a third business workflow which tracks all the traces that go through the Checkout service AND the Shipping service and call it Shipping:
Image 2: The Checkout, Shipping, and Catalog workflows
All three workflows track end-to-end traces that start with the API service, but now those traces are grouped based on whether they go through the Catalog service or the Checkout service, and then whether or not they continue to the Shipping service. Indeed, some of our customers have the rate of successful checkouts as one of the KPIs they track.
As mentioned earlier, you can easily set workflow-based alerts if there’s a problem. Similar to alerting on individual services, you can create an alert for Business Workflows based on whether their error rate or duration reached a static threshold, had a sudden change, or leverage our AI to create alerts based on historical anomaly. Splunk APM is unique in ingesting, analyzing and storing ALL traces with no sampling whatsoever, so you can be sure that no issue goes undetected.
Image 3: Out-of-the-box ready dashboard for the Checkout business workflow
Image 4: Defining an alert based on business workflows
From the developers’ perspective, they also want to track how the services they’re responsible for impact the overall business KPIs. Once Business Workflows have been defined, developers can easily see whether their service contributes errors or latency to the different workflows it is a part of by simply clicking on it in the service map:
Image 5: After clicking on the Idp service we can see here that all the errors are related to the Checkout workflow, and no errors are related to the Catalog or the Shipping workflows.
Using Workflows to Accelerate Troubleshooting
Business Workflows help business leaders and DevOps use the same language, but we didn’t stop there. Once an issue is detected, Splunk APM leverages Business Workflows to help SREs and developers troubleshoot the problem faster. In the example in Image 1, the red dot in the API service indicates a problem. Image 2 shows that there are no problematic traces in the Shipping or Catalog workflows (no red dots), and that all the problems originate from the Checkout workflow. Even in this simple example, you could have probably eyeballed that the majority of errors come from the Checkout business workflow, but it is not obvious at all that there would be no errors in the Shipping workflow. The problem becomes even more acute when you have a much more complicated service map with hundreds of services. That is why the ability to filter the service map based on workflows gives DevOps teams a powerful tool to see all the services that are involved in each kind of transaction, to understand issues in their applications, and to more easily troubleshoot them. In order to simplify troubleshooting even more, once you choose a workflow, Splunk APM automatically surfaces top tags associated with its errors and latency:
Image 6: Top error tags associated with the Checkout business workflow. On the bottom right it seems like kubernetes node 6 is the culprit…
Alternatively, if you’re the owner of a service, such as the Idp service in our example (which has a solid red dot that means it is the root cause of errors) you can use Tag Spotlight to understand which business KPIs you’re impacting, and how:
Image 7: The bottom right card in Tag Spotlight shows that errors in Idp impact only the checkout service, and the second box from the right on the top row shows that all these errors are due to invalid credentials.
But that’s not all. Since Splunk APM supports infinite cardinality, you can slice and dice your trace data based on any combination of tags, including business workflows (Image 8), and because Splunk APM stores all the traces without sampling, you can always get exemplar traces of those tag combinations (In Image 9 I superimposed the Checkout business workflow with two other tags).
Image 8: Viewing the tags for the Payment service and filtering based on business workflows
Image 9: Checkout workflow exemplar traces that goi through the Payment service, that belong to the Platinum tier, and return a 401 HTTP status code.
With these capabilities, once there’s a problem with a business workflow, SREs can use Tag Spotlight to quickly see which service is causing it, who is impacted and involve only the relevant team instead of having everyone waste their time by joining a war room. The developers in charge of the problematic service can then use Tag Spotlight themselves to quickly understand what caused the issue and start troubleshooting it.
More to Come
Business Workflows is a powerful, one-of-a-kind, tool to provide business and engineering leaders with visibility into how their applications are impacting business performance. It is purpose-built for microservices-based environments, and uses end-to-end traces to create accurate metrics to chart and alert on business KPIs. Developers and SREs can use Business Workflows to see the radius of impact of different services, and to quickly troubleshoot issues, no matter how complicated the app architecture is.
Business Workflows is available as part of the Splunk APM Enterprise edition and is already enabled for all existing APM Enterprise customers.
We are continuing to improve Business Workflows, so stay tuned for more powerful capabilities in the months ahead!
Visit our website to learn more about how you can bring data to DevOps.