To successfully observe modern digital platforms, a new data collection approach was needed. And OpenTelemetry (OTel) was the answer - an industry-agreed open standard - not a single vendor's approach - on how observability (O11y) data should be collected from a platform. This separates out data collection from the vendors’ platform of data processing and visualisation, making the data collecting approach vendor agnostic. In this blog, we will look into why modern digital platforms drove this need for OTel and how combining this with Splunk’s O11y platform solves the key challenges of managing these platforms today.
The Digital World and Users’ High Expectations:
The digital transformation revolution has been massive and unprecedented in its speed, size and scale. Technology innovations - cloud, microservices, containers, serverless compute and others - have changed not only how apps and platforms are built but also how these platforms are managed, with automation, speed of releases and 4 9’s availability being the fundamental principles of this world.
And throw in the mix too is the users’ high expectations:
- Everything should be able to be done digitally. I recently received a wedding invitation via an email with a click-through to a website that contains all the information about the bride and groom-to-be and their special day. A far cry from the days of old when invites were physical pieces of paper and sent in the post. So much easier now; all the information is accessible via a clickable link!
- Digital services should always be available no matter where the user is and should be easy to use. This has driven standards like the 4 9s whereby only 9 hours of downtime per year is allowed.
- They should be performant - i.e. a few milliseconds to load the page or complete the transaction.
If users want to use a digital service and find that it is not available when they need it, a transaction crashes and doesn’t complete when they are buying a ticket or it is slow, they will go elsewhere and their first port of call for complaining is on social media.
Availability and performance are directly linked to consumer behaviour. In a recent Google study, it was identified that if a page load increases between 1 and 3 seconds, there is a 32% increase in user abandonment rate, this increases to 123% if that page load increases up to 10 seconds.
This has put tremendous pressure on the teams that build, run and maintain these platforms; they have to ensure that they are always available and are fast. At the same time, the very nature of these environments allows frequent changes to drive forward innovation and functionality to users but equally adds to the risk of issues with the platform not working for users. Time spent on finding and fixing issues is a waste - developers, for example, want to be building and innovating and businesses want them to be releasing the next revenue-generating and feature-rich release. Instead, they could easily be spending their time finding and fixing those issues.
Why Does Traditional Monitoring Fail in These Environments?
Traditional monitoring of the old doesn’t fit the new requirements of this digital world. Our platforms are rich and evolving, use multiple technologies, rapidly changing to drive innovation and are extremely complex. Traditional monitoring though is rooted in static, monolithic environments, much simpler environments, with the same metrics collected. The approach is based on:
- No agreed standard for the collection of data; every vendor has their own approach which is tied into their platform and makes it difficult to swap in the future or adopt new, modern technologies.
- Heavyweight proprietary agents deployed within the monitored environment, each with their own data collection approaches and overhead implications.
- Metric collection is based on what each vendor identifies as important - rather than what an SRE or DevOps team needs - those all-important custom metrics that drive core visibility into whether something is going wrong.
- Polling for the metrics in near real-time - every 1 or 5 minutes - which leaves the platform not observed for the time in between. If the platform fails at this point, no visibility is provided as shown below:
- ‘Intelligent’ sampling techniques which do not capture all the executed distributed traces. This worked well in a static, monolithic platform but doesn’t work at all in a much larger ever-changing environment, with components that could live for only seconds and the same customer transaction can take many different paths - you need all traces captured to be able to solve any issue.
The Pitfalls of Vendor Lock-in:
Each vendor has their own proprietary agent and will collect metrics that they think are important. To enable better scalability and management of data, these agents process data locally, which adds additional overhead to the observed platform and data is then polled and sampled. As these environments rapidly change and utilise new technologies, it is difficult to extend the monitoring or migrate to other solutions. You do not want the vendor-specific monitoring systems to become the constraint on change and innovation on your platform. This typically leads to staying with the same vendor (vendor lock-in) and then either adding additional monitoring or building your own solution.
Why is OTel Needed?
- Vendor agnostic - an agreed industry standard on how to collect this O11y data using a single lightweight collector agent.
- No heavyweight processing is done at the agent level thus ensuring no overhead implications.
- Separation of data collection from vendors’ tooling, thus freeing you from vendor lock-in and allowing rapid changes in the environment which can easily be observed with OTel.
- Easy and quick to ingest additional data from custom metrics through to logs.
- No need to deploy multiple agents or tooling to collect O11y data or have heavy proprietary agents installed that do processing on your platform, using vital resources.
- Cost reduction - you control the data that you want to use for monitoring rather than what a vendor thinks you should have.
Combining OTel with Splunk:
The Splunk O11y platform uses OTel to collect data from an observed platform and utilising Splunk’s unique capabilities, solves the traditional monitoring challenges of old:
- Streaming metrics in real-time - this ensures that any metric - including the custom ones - can be streamed in seconds, thus avoiding observability blackout gaps where there is no visibility and allowing problems to go undetected.
- Custom metrics - the ability to choose the metrics an SRE or a dev might need that will tell them that there is a problem, is key to providing core visibility into these platforms.
- No sampling or gaps - whilst OTel doesn’t mandate full fidelity (capturing of all traces), it is a methodology that Splunk advocates. The reason is simple; transactions today take different routes through the platform and the platform itself is constantly changing. A group of transactions may work fine but another group may fail. Collecting all traces means that each and every problem is captured and is solvable quickly.
- Scalability - the volume of metrics, distributed traces and log data in these environments is not only large but the data is needed to be able to manage the platform properly. Splunk is built to scale and manage that data to provide the required visibility.
- AI-directed troubleshooting - using ML and AI to quickly troubleshoot issues.
Using OTel combined with Splunk will allow you to observe your platforms in real-time and suffer no visibility gaps, quickly solve any issue thus ensuring the platform is available and performs to your users, drive your business forward with frequent, risk-free releases packed with innovative features that increase revenue and customer satisfaction and reduce developer toil.
Try it for yourself - with a free Splunk O11y trial!