Our customers rely on Splunk’s mobile apps when they are on-call and troubleshooting in high-stress situations. Splunk’s customer base includes 96 of the Fortune 100 , many of whom rely directly on Splunk’s mobile app to help them solve outages or large scale performance problems. Therefore, they need a reliable quality of experience they have with our products and services.
My team and I work on two mobile apps at Splunk:
1. Splunk On-call (iOS & Android): an app that is used to receive and react to pages.
2. Splunk Observability (iOS & Android): an app that is used to diagnose or triage issues on-the-move.
When our customers are using our mobile apps, every second counts as even the slightest delay in the app startup has a negative impact on their MTTA (Mean Time To Acknowledgement) and hence, MTTR (Mean Time To Resolution). It’s no secret that our customers can lose up to ~$100 per each additional second that it takes to acknowledge, triage and resolve production incidents using our web and mobile apps.
Since a fast app start experience is such a critical part of our user experience, we monitor and measure key checkpoints and scenarios using Splunk Real User Monitoring (RUM) for iOS and Android. We use three measurements or Service Level Indicators (SLIs) to determine how good or bad the app start experience is in production:
1. App Startup Time
2. Time to Ready
3. Login Failures
App Startup Time
We use the benchmarks recommended by Android Vitals, and extend the same to iOS. These startup times measure the time it takes for the first frames to appear on the screen from when the app is launched. We use the SplunkⓇ RUM auto-instrumentation to measure cold, warm and hot startup times in our apps.
Time to Ready
While app start as reported by the Operating System (OS) is important, from a user-perception point-of-view, the app hasn’t fully started till they see their data loaded in it. It takes additional time for the app to be fully interactive or “ready” for the user.
To measure our apps’ “Time to Ready”, we added custom events and spans using Splunk RUM to capture the true time it takes for the app to be fully interactive and usable for the end user. In our actual code, we call this event: “o11y_user_logged_in_and_ready”.
We use the flexible OpenTelemetry Tracing APIs available in Splunk open-source distribution of iOS and Android to account for complex application logic in multiple user paths, and arrive at a single metric for “Time to Ready”. We’re also observing key checkpoints as part of the Time to Ready sequence to quickly identify what the bottlenecks are and continuously optimize the startup process.
Multiple User Paths, One Metric
P1 or Path 1 (most common app user path): An existing app user has their authentication token securely cached in a keystore. When the app opens, the “Time to Ready” (or “o11y_user_logged_in_and_ready”) custom event is started and the following steps are captured as spans:
a. When the app successfully authenticates, we capture the “o11y_socket_connection_attempt” span, completing our first checkpoint.
b. Next, the app requests data on the user’s account, their alerts, dashboards, and other application data which is sent back in multiple response messages and subsequently processed. This is captured in the “o11y_fetch_and_store_dashboards” span in Splunk RUM.
c. In parallel, the app applies it’s logic to route the user to the right screen and start rendering as data streams in. When the screen is loaded, we stop and capture the “Time to Ready” custom event.
P2 or Path 2 (infrequent app user path; <10% of the time): If an app user with cached credentials attempts to authenticate and fails due to an invalid token or expired token, the app routes the user to the login screen and stops to report the “Time To Ready” (or “o11y_user_logged_in_and_ready”) custom event. In addition, the app stops the other spans, such as “o11y_socket_connection_attempt”. The same principle is applied when a new user enters invalid credentials in the login flow.
The inability to login when you’re in a hurry, leads to user frustration and potentially app abandonment. The reasons for login failures are multiple and Splunk RUM captures status codes and messages for each state of the following cases:
- 403: Incorrect username and password
- 503: Backend not accepting authentication requests
- 302: Misconfigured Single Sign On (SSO)
We keep a close eye on the rate of 503’s as part of our SLIs and work with our backend teams to take immediate action anytime it spikes.
Deep Mobile Application Observability
Ever since we included Splunk Real User Monitoring into our Observability stack, our mobile engineering teams have a clear view into the end-user experience and how every code change and new version rollout impacts it.
We continuously measure, monitor and optimize various real-user metrics as SLIs, including the three SLIs we’ve shared above. Read our blog on “Optimizing Mobile App Startup with Splunk RUM” to learn about how we identified front-end and back-end bottlenecks and improved Time to Ready of our apps by 30%+.
This blog was co-authored by Seerut Sidhu, Sr. Product Manager @Splunk.