Splunk’s AI Assistant: Top 7 Use Cases for AI-Driven Observability
With the recent release of AI Assistant in Observability Cloud in US and Europe realms came an incredible amount of excitement and frenzied interest. But with great power comes great responsibility! Like with any new shiny object, you might not be sure exactly how to properly utilize the AI Assistant and maximize its potential.
Look no further - this article is for you ! I’m a Splunker working in the Growth Engineering Marketing team and couldn’t resist trying the AI Assistant myself. I’ve been asking it to help me in my day to day tasks and have found 7 use cases that have helped improve my observability tremendously!
Use Case 1: Instance Health Analysis
A simple yet crucial question that every engineer wants to know the right answer to: how is my instance doing? Now, in Splunk, you can just provide the instance name, and the AI Assistant will deliver a thorough analysis of your instance. You might even feel like it's magic—because it really is! Here's a simple AI-generated response from one of my instances as an example.
After maintaining this instance for several years, I’ve never had such deep insight into it! Now, I can confidently review all the AI-recommended optimizations and enhance my system's reliability.
Use Case 2: Trace Analysis
Each request to the application generates a trace ID. In OpenTelemetry, a "trace ID" is a unique identifier assigned to a specific request or operation being monitored by the system. If you've received an alert from APM about one of your backend calls, you can grab the trace ID and ask the AI Assistant to perform an independent analysis of the issue. Here's an example of an AI-generated response based on one of my trace IDs. After reviewing it, you won’t need to say a word—just go ahead and create a ticket for your development team to address the issue immediately.
Key Observations
- The trace involves multiple HTTP GET and POST requests, both internally within services and externally to services like Mko.
- Operations span fetching tokens, submitting form data, extracting metadata, and more.
- The trace includes interactions with external services, notably Mko, for lead management and form submission.
Error Analysis
- An error was identified in the operation MkoService.createMkoPostRequestFormRequestBody within the www-commons-svc service. The error is related to a java.net.MalformedURLException exception, indicating an issue with forming a URL. Specifically, the error message "no protocol:" suggests that the URL being constructed or used lacks a protocol specification (e.g., "http" or "https").
- This error is marked as the root cause of an issue within the trace, pointing to a potential problem in how URLs are constructed or processed in the MkoService.
Recommendations
- Review the implementation of MkoService.createMkoPostRequestFormRequestBody to ensure URLs are correctly formed, including the protocol part.
- Validate all URLs and URI templates used in the service to prevent similar issues.
Use Case 3: Quick Feedback
I’ve always valued feedback on my code, service, or feature—how it performs after going live, whether there are significant bottlenecks, and if it’s meeting expectations. This feedback not only justifies the work I’ve done but also highlights areas for improvement, keeping me prepared for necessary changes. It does not need to be back whole.
AI assistance shines in this area, offering valuable insights—and sometimes, a boost to your ego! Here’s the feedback I received from AI after a recent implementation to optimize one of the most heavily used service endpoints. It brought a smile to my face.
I asked APM AI to analyze Production common-service for the past 8 days. It highlighted The GET /api/bin/careers/joblist endpoint was highly utilized with 87,001 requests and no errors, showcasing its stability.
Use Case 4: Make the Case
The other day, I asked my friend, “AI”, "Can you analyze the memory utilization of my monolithic application over the last six weeks?" Unsurprisingly, it came back with the observation that there might be a memory leak in the application. The AI Assistant said, "I’m sorry to inform you."
My response? "No need to apologize—I should be grateful!" I conducted a similar analysis six months ago after a production issue and reached the same conclusion. However, my team didn’t believe me, and my findings were overlooked. But now, with AI’s analysis in hand, they can't ignore it. Hopefully, this time the team will take action!
The analysis
Based on the analysis of the memory utilization and total memory over the last 24 hours for the AWS instance with the unique ID i-039e37a2bdd344f04_us-west-1_576561709469, here are the findings:
- Memory Utilization: The memory utilization has been fluctuating over the last 24 hours, starting from approximately 42.45% and gradually increasing to around 63.65%. This indicates a steady increase in memory usage over the period.
- Total Memory: The total memory available on the instance has remained constant at 66,056,302,592 bytes (approximately 61.5 GB).
Considering the steady increase in memory utilization without any significant drops, it suggests that the workload on this instance is gradually consuming more memory.
Use Case 5 : Predictive Analysis
We often find ourselves needing to assess whether our current system can handle anticipated traffic—whether it’s a new product launch, a Black Friday surge, or providing assurance to leadership about system capacity. This assessment often involves determining if the system is sufficient or if additional nodes are necessary.
With AI Assistant, you can quickly gain insights into your system’s status and make informed predictions about future requirements. A few weeks ago, I faced a similar challenge. Here’s the analysis provided by the AI—I'll let you be the judge.
Looking at last month's service performance of wmd-srv-prod , can it handle an extra 1000 requests per hour?
Analysis of the service performance for "commons-svc" in the "wmd-svc-prod" environment over the past month indicates the following:
- Total Requests: 52,015
- Total Errors: 308
- Error Rate: Approximately 0.59%
The service has handled a significant number of requests with a relatively low error rate. The current request count per hour is approximately 72 requests (52,015 requests over 720 hours in a month).
Can it handle an additional 1000 requests per hour?
Given the current performance metrics, the service appears to be stable with a low error rate. However, handling an additional 1000 requests per hour would significantly increase the load. Here are some considerations:
- Current Load: The service is currently handling about 72 requests per hour.
- Proposed Load: An additional 1000 requests per hour would increase the load to 1072 requests per hour.
Recommendations:
- Load Testing: Conduct load testing to simulate the additional 1000 requests per hour to observe how the service performs under increased load.
- Resource Allocation: Ensure that the infrastructure supporting the service has sufficient resources (CPU, memory, etc.) to handle the increased load.
- Monitoring: Continuously monitor the service's performance metrics, especially error rates and latency, to ensure that the service remains stable under the new load.
Use Case 6: Unknown Unknowns
If you’re familiar with observability, you know exactly what I mean by "Unknown Unknowns." With AI Assistant, you’ll encounter many surprising discoveries.
For example:
- "Oh, I didn’t realize this service was using MongoDB to store data—I thought it was something else. We should include this instance in our migration plan."
- "Wait, this service is calling the token API four times! We can make it faster and more reliable by reducing that to just one call, right!"
You'll uncover numerous such "Unknown Unknowns" once you start engaging with AI. The importance of this process can’t be overstated when it comes to making your application more reliable and performant. Ultimately, this means a better experience for your end users , additionally it can translate into real cost saving (i.e de-commissioning ghost servers) as well .
Use Case 7: Onboarding New Team Members
Our system is quite complex (and honestly, which isn't these days?). Bringing a new team member up to speed is a challenge on its own. You provide the code, share documentation links, Zoom recordings, and even give one-on-one demos. Yet, don’t be surprised when follow-up questions come your way, leaving you thinking, "How many times do I need to repeat this?" You’re swamped with sprint work and wish they could just Google it—but they can’t, because it’s our internal application.
This is where the AI Assistant in Splunk Observability Cloud steps in to save the day! It explains the system, breaks down the application, outlines request/response workflows, and maps out the entire transaction flow. As a result, not only does the new person get to know the system inside out, but they might even bring up some interesting insights in meetings—stuff no one else knew, except for AI!
Thanks to Splunk’s AI Assistant, I can stay focused on my tasks while my teammate gains enough knowledge to become productive quickly.
Here is an example on how AI can kick-start your learning:
The dashboard link provided by AI has an in-depth view of the service. As seen below:
We can get details about all the specific end-points.
And, then ask AI to explain the Trace ID to understand the application further.
Conclusion
I recently started using Splunk’s AI Assistant in Observability Cloud, and I’m blown away by what it brings to observability. I’m excited to share my experience with all of you! I hope the common use cases above help you and your team leverage GenAI to deliver an outstanding customer experience.
Why wait? Let’s dive in! If you're already on Splunk Observability Cloud, you can explore the use cases above for yourself. If not, you can learn more and sign up for a free trial here. Use this AI Assistant resources.
Download the AI eBook here and I’ll see you on the other side!