When monitoring your application performance or troubleshooting an issue in production, context is key. The more information available, the faster the prevention of or detection of a user impacting issue. Observability tools offer many different features, like code profiling, to help contextualize your data. In this post, I’ll discuss what code profiling is and show an example of how it works.
Code profiling provides engineers with code level visibility into resource bottlenecks to help troubleshoot service performance issues. Engineers can continuously measure how their code impacts CPU and memory usage and leads to slow service performance. Before delving further into code profiling, let's define a few terms:
- Call Stack - A call stack is a list of what functions/methods are currently being called by your application
- Stack Trace - A stack trace is a snapshot of a specific call within the call stack at a certain point in time that contains information like service name, class name, method/function name, and code line number. Performance data like timing and memory usage can also be collected as part of these traces
- Trace - A trace is a group of calls that represents a unique transaction. Generally used to follow a customer transaction across a collection of services
- Span - A span is one of the calls within the trace.
Code profiling collects call stacks from production environments, and an agent sends periodic snapshots of call stacks via collector to the APM backend. The APM solution then visualizes code performance through charts or flame graphs, helping engineers understand if their code is performing poorly.
Code Profiling in Practice
Splunk APM offers code profiling capabilities via the AlwaysOn Profiling® tool and this year at AWS re:Invent, attendees were able to participate in an AWS GameDay session that walked them through identifying and resolving a performance impacting code issue using Splunk APM and AlwaysOn Profiling.
GameDay teams were presented with a Java web app that had a hidden long running call. By implementing Splunk APM and configuring the Splunk AlwaysOn Profiling tool, teams were able to identify the code issue down to the specific file and line number within that file.
During the session, Splunk APM was quickly setup by teams following the guided walkthrough available within the Splunk Observability Cloud UI. Once players had instrumented their application to send data to Splunk APM using OpenTelemetry, teams were able to view application metrics, service maps, and business workflows within minutes.
For additional troubleshooting information, teams implemented AlwaysOn Profiling with a simple update to the command line used to launch their application. After AlwaysOn Profiling was enabled, call stacks for the application services were available within Splunk APM.
Teams were then able to troubleshoot and identify the long running call impacting application performance. AlwaysOn Profiling displayed the call stack information to the teams in both table and flame graph format. The call stack provided information about the individual methods and calls by the code within the desired service flow.
In the table on the left, we can see the name of the method executed, the amount of time spent executing that call, and how many times that call shows up in the call stack. Right away, teams were able to see that there was a long running sleep call. By selecting and expanding the sleep call, we can see one of the traces is taking significantly longer than the others.
When the longest running sleep call has been selected, teams were able to see the stack trace for this call displayed on the right. The stack trace showed that the long running call was occurring in the precheck method located in DoorChecker.java, located on line 36. After identifying the issue, teams were able to make code changes to update the application and fix the long running call.
Using Splunk APM with AlwaysOn Profiling enabled helped teams quickly identify, locate and resolve the issue. Application issues may not always be as simple as system downtime – slow performance can also lead to unhappy customers. The performance and code level insight gained from using APM and code profiling tools can lead to faster issue resolution and optimized application code.
AlwaysOn Profiling capabilities are currently available for Java, .NET and Node.js applications, with more languages coming soon. Automatic instrumentation without profiling support is also available for Go and several other languages. To experience the speed and power of code profiling for yourself, start a Splunk Observability Cloud free trial.