DEVOPS

Splunk APM Expands Code Profiling Capabilities with Several New GAs

We’re excited to share new Splunk capabilities to help measure how code performance impacts your services. Splunk APM’s AlwaysOn Profiling now supports .NET and Node.js applications for CPU profiling, and Java applications for memory profiling. AlwaysOn Profiling gives app developers and service owners code level visibility into resource bottlenecks by continuously profiling service performance, at minimal overhead. Whether refactoring monolithic applications, or using microservices and cloud native technologies, engineers can easily identify CPU and memory bottlenecks in .NET, Node.js, and Java applications, linked in context with their trace data. Along with Splunk Synthetic Monitoring, Splunk RUM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, AlwaysOn Profiling helps detect, troubleshoot, and isolate problems faster.

As a background, our initial GA blog post announcement provided a thorough walkthrough for CPU troubleshooting. For memory profiling, here’s an overview of how to identify problems and troubleshoot bottlenecks. For more on memory profiling, please see our docs on memory profiling, or view this detailed video walkthrough

Workflow: Slow Response Time From High Memory Allocation

Service owners notified about slowness log in to Splunk APM see that service latency metrics indicate a spike in response time after a new deployment. We know that potential bottlenecks can come in many of the following forms:

  • Code issues (e.g. slow algorithms or slow I/O calls)
  • Runtime issues (e.g. Garbage Collection overhead)
  • Configuration issues (e.g. DB connection pool being too small)
  • Operating system level issues (e.g. “noisy neighbors” consuming CPU, or thread starvation)

As code issues are common causes for slowness, we analyze code performance first. We look at a few example slow threads and their CPU profiling data within our traces, and examine bottlenecks across our services (as detailed in our initial GA blog post).

If we can’t find any code bottlenecks our next step is to examine the runtime metrics. Splunk APM’s runtime dashboard displays a plethora of runtime-specific metrics, all gathered automatically by our runtime instrumentation agent. 

Looking at these charts, we see our JVM metrics, and notice abnormalities in Garbage Collection.

When we examine the “GC activity” chart we see for every one minute period, the garbage collection process takes upwards of 20 seconds. This could indicate that our JVM is spending too much time doing garbage collection instead of servicing incoming requests.


By looking at CPU usage we confirm our suspicions. The JVM incurs a significant amount of overhead (20 to 40% of CPU resources) from garbage collection, leaving only 60-80% of CPU resources for actual work (ie. serving incoming requests).

The most obvious case for excessive garbage collection activity is code allocating too much memory, or creating too many objects for which garbage collection must account. To visualize code bottlenecks we use flame graphs (for more details, see using flame graphs). The below flamegraph visually aggregates the 457k+ call stacks captured from our JVM when our code was allocating memory. The widths of stack frames, represented as bars on the chart, tell us how much each stack frame proportionally allocated memory. 

The lower part of the flame graph points to our first party code, meaning we will likely have the option to re-engineer our code to allocate less memory.

From here we can either switch to our IDE and navigate to the right class and method indicated by the stack frame manually, or click a specific frame to display details under the flame graph.

If we click “Copy Stack Trace,” we place the entire stack trace containing the frame into our clipboard. We can then navigate to our IDE, and paste it to the “Analyze Stack Trace” / “Java Stack Trace Console” or similar dialogue window, and the IDE will point us to the exact line in the right file.

After fixing and redeploying the source code, we view our flame graph to see that specific stack traces aren’t constraining memory, and verify that Garbage Collection overhead (and service response time) have decreased.

Why Splunk for Code Profiling?

Unlike dedicated code profiling solutions, Splunk’s AlwaysOn Profiler links collected call stacks to spans that are being executed at the time of call stack collection. This helps separate data about the background threads from active threads which service incoming requests, greatly reducing the amount of time engineers need to analyze profiling data.

Additionally, with Splunk’s AlwaysOn Profiler, all of the data collection is automatic, and low overhead. Instead of having to switch the profiler on during production incidents, users only need to deploy the Splunk-flavored OpenTelemetry agent and it begins to continuously collect data in the background. 

To learn more, find more detailed instructions in our AlwaysOn Profiling documentation

Not an APM user? Sign up for a trial today.

Mat Ball
Posted by

Mat Ball

Mat Ball leads marketing for Splunk's Digital Experience Monitoring (DEM) products, with the goal of educating digital teams on web performance optimization, specifically the art and science of measuring and improving end-user experience across web and mobile. He's worked in web performance since 2013, and previously led product marketing for New Relic's DEM suite.

TAGS
Show All Tags
Show Less Tags