Most front-end developers and practitioners are familiar with real user monitoring (RUM) tools as a means to understand how end-users are perceiving the performance of applications. Few people, however, are aware of the history of the RUM market, going back more than two decades. Over the years, as the internet has evolved with new technologies, RUM tools have evolved in lock-step to cater to the ever changing needs and use cases of engineering teams. In this post, we aim to trace this history, and argue that we are again on the precipice of change, where legacy RUM tools are no longer good enough for new users and their needs.
RUM 1.0 (~Early 2000s) Server-Side Rendered Apps
The nascent days of the web were characterized by lack of standards in front-end architectures. Netscape and Microsoft were engaged in fierce browser wars, and each pushed for its preferred way of building dynamic experiences, interactions and animations in web content.
These enterprise apps were monoliths hosted in corporate data centers. They were designed as a collection of discrete pages, in an architecture known as a MPA (multi page application).
Netscape Navigator from the earliest stages of web monitoring
Since the pages were rendered server-side, user experience critically depended on the server processing time and the network latency (from server to client). RUM tools of this era measured document-loads and view-count, as well as server and network times for each page. The main RUM users in this era were ITOps and helpdesks:
- They cared most about document load times and their breakdown, into network time, server-render time, server-processing time. and database time. Their primary motivation was to identify which internal team to contact, in the event of a customer complaint
- Since rendering was done by a monolith, they cared only about a cumulative server-processing time. They used RUM to measure this metric.
- ITOps also cared about website uptime and network performance.
Several technological shifts during ~2005-2010 led to an emergence of RUM 2.0 tools.
1. Client-side rendering began to gain in popularity, because of several reasons:
d. Emergence of third-party services, such as Google Maps, that were frequently invoked client-side.
RUM 1.0 tools did not monitor client-side interactions at all.
1. Emergence of public cloud platforms: Developers increasingly became responsible for deployment and maintenance, in addition to writing code. They needed tools that gave visibility into all parts of the stack (e.g. APM for backend, RUM for frontend, NPM for network) instead of siloed RUM 1.0 tools.
RUM 2.0 (~Early 2010s) Rise of Client-Side Rendering
Main user of RUM was the ITOps team, and increasingly, the SRE.
2. Single Pane of Glass: SREs wanted a single pane of glass for answering the question “If a user feels that the site is not working, who do I blame? Browser/end user? Network? Monolithic application server? Databases?"
Since apps were mostly MPAs consisting of discrete pages, RUM tools continued to measure individual page-views and document-loads. Each document-load generated some activity on the browser, some network activity, some transactions on the monolith, and finally some database queries. These transactions happened linearly, in a sequence. The complexity of the system was relatively low, as each point in this chain made a request to a single (or relatively few) point.
The sequential nature of these activities meant that RUM tools, that measured the overall document-load time, and split it into server-time, network time, and DOM processing time, and page rendering time, were generally good enough
A few trends started disrupting RUM 2.0 tools.
a. Since the document-load only happens once, page-views/document-loads are less relevant as units of measurement. Instead, RUM solutions need to measure the performance of the many API requests and interactions between the browser and the resources (i.e., XHRs/fetches).
b. Client-side code became much more complex, and more prone to unforeseen bugs, errors and performance issues.
2. Cloud Native application development and the rise of highly distributed applications: caused a major shift in how applications are being built, deployed and operated. A modern application is a distributed system of services (or micro services) built in-house, and third-party, cloud services.
a. This explosion of complexity on the backend meant that if a transaction’s server processing time was too high, it was not trivial to say which sequence of operations on the backend led to the high response time.
b. APM 2.0 tools were not designed for capturing the inter-service delays at full-fidelity. If RUM 2.0 tools indicated high server processing time, there was no way to identify the root cause if the application was a cloud native SPA, and thus, the need for an end-to-end trace arose.
In recent years, application development is fast becoming fully cloud native. On the front-end, apps are increasingly written in js-frameworks such as React and Angular. The web page loads as a single document load, followed by multiple XHRs/fetches to a variety of resources. Very little rendering occurs on the server-side.
Front-ends are typically much more complex than before. They could be using multiple js frameworks, depend on multiple third-parties to work and perform correctly, and may bridge or touch multiple parts of a customer’s business. The backend is composed of several loosely coupled microservices and serverless functions.
Splunk RUM’s overview page links modern user-experience metrics with backend system performance
Main users of RUM are now SREs, and increasingly, the front-end developers.
1. Unit of Measurement: They need a tool that can measure individual browser-resource interactions, and not just document-loads. As an example, consider the infinite scrolling experience in Twitter. Five minutes of scrolling content could generate 100s of XHRs, any of which could be slow, but this would count as a single document load in a RUM 2.0 tool.
2. End-to-end Tracing: Cloud-native customers need to find the smoking-gun when a problem occurs. The only way a SRE can claim with certainty that a problem originated in the backend, is by looking at the exact backend trace as it propagates through a distributed backend. And they can only get this information, if the backend tracing is done at full-fidelity, without any sampling, and tied to the front-end activity.
Enter Splunk RUM!
Splunk Real User Monitoring has been engineered to provide visibility to the cloud-native applications, whose front ends are complex, featuring dozens of API calls to a variety of providers, and are typically single page apps (or hybrid apps) written in a framework such as React or Angular.
Splunk is bringing the philosophy of unsampled distributed tracing to front-end monitoring. This ensures that SREs and Front-end developers will never miss an anomaly, and will have visibility into every user-interaction, every resource, and every XHR made by any end-user. If a customer deploys both Splunk RUM and Splunk APM, they will have complete visibility end-to-end: for any request made by a browser, they will be able to identify the unique backend trace that was initiated by the browser request. In other words, engineers will always have that smoking-gun, to answer the question: “If an end-user has a problem, is it a front-end issue, a backend-issue, or something else?"
Splunk RUM seamlessly connects transactions from the frontend, to backend services
Splunk RUM is part of the Splunk Observability Cloud, which provides a single-pane of glass, for customers to gain unprecedented visibility into their infrastructure, applications, and logs. To find out more about Splunk Real User Monitoring, refer to this whitepaper and start a free trial today.
- Splunk Real User Monitoring product page
- Splunk Digital Experience Monitoring product page
- How to Optimize Digital Experience with Service Level Objectives whitepaper