.conf & .conf Go

October 21, 2021

3 Minute Read

Announcing the Preview of Splunk APM’s AlwaysOn Profiling

By Mat Ball

For application developers and service owners who build and troubleshoot modern enterprise software, resolving production issues requires identifying poor performance across multiple networks, operating systems, servers, configs, and third party dependencies. When the problem is the code itself, code profiling helps identify service bottlenecks by periodically taking CPU snapshots, or call stacks, from a runtime environment. Information from call stacks provides additional context for slow spans from transaction traces, and helps visualize bottlenecks through flamegraphs, to show service performance over time. These benefits speak for themselves, but most other code profiling products incur notable performance overhead, which requires engineers to manually switch them on or off, creating a tradeoff between application performance and available data.

We’re proud to announce the Beta of AlwaysOn Profiling, part of Splunk APM. Available initially for Java-based applications, AlwaysOn provides continuous visibility of code-level performance, linked with unsampled trace data, with minimal overhead. Along with Splunk Synthetic Monitoring, Splunk RUM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, AlwaysOn Profiling gives engineers more context to identify performance issues and troubleshoot faster across production environments.

Troubleshooting Code Bottlenecks with AlwaysOn Profiling

Splunk APM’s AlwaysOn Profiler is constantly monitoring code performance to give you immediate context of where performance bottlenecks exist. Here are two examples of how AlwaysOn can help identify production issues:

Workflow One: Viewing Common Code in Your Slowest Traces
Engineers troubleshooting production issues often sort through example traces looking for common attributes in their slowest spans. AlwaysOn’s call stacks are linked to trace data, providing context into which code is executed during each trace.

Within APM you can easily view latency within your production environment.

By clicking into any service you’re taken to the service maps, which provide additional context on bottlenecks within that service and its dependencies.

From here, we can explore example traces.

Note: We filtered the “min” by 10,000, or ten seconds, to focus specifically on the slowest traces. We see that requests to /stats/races/fastest repeatedly respond in around 40+ seconds.

By clicking into one of these long trace, the following screen opens:

We see that while the StatsController.fastestRace operation was being executed, we collected 36 call stacks. As the java agent continuously collects call stacks, the longer the spans, the more call stacks they will have. When I open this span, I see the metadata on the left, and the call stacks that the agent collected on the right. We can use the “Previous” and “Next” buttons to flip through all call stacks:

If you see several consecutive call stacks pointing to the same line of code, it indicates that these lines take a long time to execute, or execute many times in a row. This is often a solid hint at a performance bottleneck.

Workflow Two: Viewing aggregate performance of services over time
Before you begin optimizing code, it’s always helpful to understand which part of your source code impacts performance the most. How do you know which part is the biggest bottleneck? This is where aggregation of collected call stacks, in the form of flamegraphs, helps.

When viewing your service map, notice the code profiling addition on your right side panel, which automatically shows you the top five frames from the call stacks we’ve collected for your selected time range, that already point to bottlenecks in code.

By clicking into the feature, you’re taken to a flame graph, which is a visual aggregation of call stacks collected from the time range you’ve specified. Flame graphs visualize call stacks across a time range — the larger the horizontal bar, the more frequently that line of code is found in the collected call stacks.

Upon viewing the flamegraph, focus on larger top down “pillars”, which indicate lines of code that use the CPU the most. If you want to highlight your own code classes in the flamegraph, use the filter in the top left.

Within each horizontal bar of the flamegraph, there are class names and line numbers for your code. Flame graphs point you to the bottleneck causing the slowness, and the final step in troubleshooting is returning to your source code itself to fix the problem.

Code Profiling within Splunk’s Observability Solutions

Unlike dedicated code profiling solutions, Splunk’s AlwaysOn Profiler links collected call stacks to spans that are being executed at the time of call stack collection. This helps separate data about the background threads from active threads which service incoming requests, greatly reducing the amount of time engineers need to analyze profiling data.

Additionally, with Splunk’s AlwaysOn profiler, all of the data collection is automatic, and low overhead. Instead of having to switch the profiler on during production incidents, users only need to deploy the Splunk-flavored OpenTelemetry agent and it begins to continuously collect data in the background.

Try It Today!

With “Always On” profiling, teams using Splunk APM can now analyze and improve both intra-service performance of code heavy monoliths, and inter-service performance of microservice based architectures, to troubleshoot bottlenecks and optimize service performance at any stage of cloud migration.

Sign up for the preview to get started today.

Follow all the conversations coming out of #splunkconf21!

Follow @splunk

Splunk Social Impact: Bridging the Data Divide at .conf21

Get a closer look at all the ways to see how our customers, partners and employees are using data to solve social and environmental challenges at .conf21 Virtual this year.

.conf & .conf Go 4 Min Read

.conf23: A DevOps/ ITOps and Observability's Guide to the Must-Attend Sessions

Going to .conf23 but not sure which sessions to attend. Splunker Stephane Estevez took a closer look at the catalogue and highlights the ITOps and observability session that you absolutely cannot miss.

.conf & .conf Go 2 Min Read

Impact Bytes: HelpMeSee Honored with the 2024 Splunkie Change Agent Award

At Splunk’s .conf24 in Las Vegas, Splunk proudly announced HelpMeSee as the 2024 Change Agent, honoring the nonprofit’s groundbreaking work to expand access to life-changing vision surgery.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram