One of the most challenging and rewarding things I do as a Principal Software Engineer in our Splunk Mobile division is ensuring our customers’ experience meets the quality and standards we promise to keep.
My team and I are part of an on-call rotation that is committed to measuring and optimizing key Service Level Indicators (SLIs) using Splunk Real User Monitoring (RUM) and Splunk On-Call (iOS & Android) mobile apps. Two such SLIs that keep us up at night are App Startup Time and interactive time (what we call “Time to Ready''). These metrics measure how long it takes for our apps to be fully functional and ready for user interaction. For more information on how to measure and monitor these metrics, see our blog post on “Deep Dive into the App Start Experience”.
In order to keep track of the results and progress made we have created charts and alerts that send an alarm to an On-Call Engineer, whenever our SLIs don’t meet our standards.
Our On-Call engineer is paged when the p75 of “App Startup Time” or “Time To Ready” is over 5 seconds. When paged, we break down the RUM metrics by platform, app version, and OS version to identify whether new code is impacting the performance. In addition, we have detailed information in the Session Details page for each instance of the longer-than-expected App Startup Time or Time to Ready. With every page we receive; we either incrementally improve our SLIs, or add more custom events to gain a deeper understanding of the “Ready” sequence. We conduct post-incident review meetings to discuss each page and the action taken to improve our “App Startup Time” and “Time to Ready”.
The 3:16 am Page
Soon after we signed one of our biggest customers to-date, our On-Call engineers were paged and woken up in the middle of the night up to two times in a week. On looking at the data in Splunk RUM, we learnt that the “o11y_fetch_and_store_dashboards” time was extremely long for this new customer.
Waiting on the Backend
During the incident, we identified that the API response time to retrieve user preferences was extremely high (5.8 seconds). Though this API call was expected to be seen multiple times for large data sets, we saw only one, i.e. large data sets were expected to be paginated.
On connecting the Splunk RUM trace back to Splunk APM, we found that the primary query used was suboptimal and did not paginate the results as we had expected it to. On putting in a hot fix for the API that this impacted, we reduced the Time to Ready by 10% immediately.
The metrics collected pointed us to code where we were looping through a contact list to extract fields. The loop took over 50% of the overall Time to Ready duration. As it turned out, the code looped through the contact list three times to extract key fields leading to a long “o11y_dashboard_list_favorite_load_time”.
The suboptimal code went unnoticed in our QA environments as the iterations were fast on short contact lists; but for our largest customers it was extremely slow and the problem was exacerbated. By optimizing the code to loop through the contact list only once, we reduced upto 5 seconds for our large customers in the subsequent release, improving the Time to Ready performance by 33% and dropping our after-hour pages to zero.
Real-time Observability + Mobile On-Call = Happier Customers
After the new release, we had an uptick in engagement from the new customer, which helped us retain them long-term. Our on-call rotations drive ownership within our team of mobile developers to monitor our SLIs, and adding Splunk RUM to our mobile Observability stack has made it easier-than-ever-before to improve them. Getting started is easy for iOS and Android — sign up for a free trial here.
Read our blog, "Deep Dive into the App Start Experience" to learn more on App Startup Service Level Indicators (SLIs).
This blog was co-authored by Seerut Sidhu, Sr. Product Manager @Splunk.