Jenkins, OpenTelemetry, Observability

If you’re like most organizations, you’re leveraging Jenkins for all sorts of things. Deployment pipelines, automated API tests, even glorified CRON jobs just to name a few. 

How Do You Gain Insight Into These Various Types of Pipelines?

  • Build Logs: Good for auditing and troubleshooting but difficult to use for long term metrics trends.
  • Time Series Metrics: Great for establishing the health of Jenkins instances and identifying issues over longer time periods. Less great for gathering data on specific jobs and/or steps of jobs due to higher cardinality.
  • Tracing/APM: A more uncommon approach that provides detailed waterfall charts of individual runs and steps. Allows checking outcomes and individual step data at a glance. Consider it a combination of build log detail and time series visualization over time.

Tracing and APM for Jenkins recently became much more straightforward with the advent of the OpenTelemetry project and an OpenTelemetry Jenkins Plugin (Maintained by Cyrille Le Clerc). Once configured, a single click can take you from your Jenkins job into a detailed waterfall chart of the entire pipeline run!


Why Would I Want APM Data from Jenkins?

Combining the power of OpenTelemetry (OTEL), Jenkins, and Splunk APM you can leverage the granularity of distributed tracing to understand specifics of your Jenkins usage that were previously difficult to uncover while having full control of your data. 

Build queue times starting to become excruciatingly long? Quickly identify builds and steps holding up Jenkins for unusual amounts of time.

Noticing a slow increase in the time it takes to run pipelines across your organization? Send your Jenkins APM data through Splunk Log Observer to emit time series metrics of all steps and easily visualize increased (or decreased) time spent on various steps across all jobs even after your Jenkins data has aged out of APM. 

Are calls to external services taking longer than average? Perhaps git checkout takes longer than average or a given API’s response has become slower over time. Splunk APM’s Tag Spotlight can help visualize lengthy calls to external services in your pipeline with P50, P90, and P99 values.

Want to know when another Team’s builds are happening that may impact your service? Set up a detector on their deployments and have an event marker show up on your dashboards to quickly establish if their deployment has impacted your service’s performance.

APM (or distributed tracing for those historically inclined) is a powerful tool for understanding interactions over the entire lifespan of a given process; in this case, Jenkins deployments. Not only does it give you a nifty waterfall chart of where time was spent in each step of a Jenkins deployment, but it also provides additional data to aggregate with more common time series metrics and traditional build logs. Various parts of your organization may benefit from Jenkins trace data in unexpected ways:

  1. IT Operations / Support Analysts: As a member (or Head) of the IT Operations team, I want up to date build information on important services to notify Software teams if recent deployments are causing or related to a service interruption. 
  2. DevOps / SRE: As a member (or Head) of a DevOps or SRE team, I need to ensure services are healthy, and if not, quickly track down the cause. The ability to provide stakeholders with visibility into issues caused by deployment of applications and infrastructure will help them improve their software development and deployment practices improving overall MTTD.
  3. Software Developers: As a member (or Head) of a Software Development team I want to know if a recent deployment of my own software or upstream services is causing an issue before customers are impacted without jumping between different tools and UIs.
  4. CI/CD: As a member (or Head) of a team in charge of Continuous Integration / Continuous Delivery solutions I want to understand why and where our CI/CD pipelines are slowing down, and how to best address any issues to quickly improve CI/CD services provided to DevOps, SRE, and Software Development teams.

With Jenkins, Splunk APM can address these concerns quickly in one place without being overwhelmed by tool sprawl. There is no need to utilize multiple tools and jump between different interfaces for Jenkins, logging, and monitoring data to understand what's really going on. 

Setup: How to Hit the Ground Running

To get setup, quickly check out the Github repository for OpenTelemetry Collector configuration examples, documentation, and 2 Splunk Observability Cloud Dashboard exports to get you started. Armed with these artifacts and an OpenTelemetry Collector you’ll quickly be able to provide more detailed Jenkins insights for IT Operations, CI/CD teams, and DevOps professionals. 

Figure 1-1. Get detailed Jenkins pipeline metrics with Jenkins APM data

Out Of The Box Dashboards

The Github repository linked as part of this blog includes two dashboards meant to help understand specific Jenkins Pipelines and also overall Jenkins Health. They can be leveraged as-is or used as a starting point for building your own more detailed deployment dashboards.

Also included in the Github repository are instructions and SignalFlow for setting up a Detector to notify you of failed deployments. This sort of detector is useful not only for knowing when your own deployments have issues, but also for knowing when an upstream service you depend on is having a problem due to a failed (or successful) deployment. Exposing these types of events on your dashboards can help provide more context with less tool sprawl..

How do you get these insights today and how much effort does it require?

Figure 1-2. Overall Jenkins Health: Observe valuable Jenkins agent, build queuing, and even detailed step metrics (with Log Observer) at a glance.

The Future

OpenTelemetry, APM, and Infrastructure Monitoring are integral, and until now separate, but crucial tools for understanding your services. With their powers combined in one tool you will more quickly establish effects of deployments, understanding of Jenkins performance, and gain the ability to quickly notify teams of issues with their own or other services related to software builds and releases. But, the future is even brighter! These additional insights into Jenkins can help unlock metrics for better understanding the larger impacts of DevOps within your organization.

Jenkins and DORA Best DevOps Friends

DevOps Research and Assessment (or DORA) metrics address a fundamental set of concerns when attempting to measure DevOps activity and performance. The four key metrics associated with DORA that may benefit from or require additional Jenkins context are:

  • Deployment Frequency: How often are you deploying your code with Jenkins? Chances are, that with a bit of effort, you can dig up and report on this data already. But, imagine a dashboard per org, team, or service showing this number in a single chart by leveraging APM or Log Observer data emitted from Jenkins.
  • Change Failure Rate: Going hand in hand with Deployment Frequency is Change Failure Rate. Similar to the familiar monitoring metric of Error Rate; you can leverage your Jenkins APM data to quickly visualize Change Failure Rate at various levels of organizational complexity. This metric can be invaluable for determining and prioritizing DevOps work related to improving your delivery and deployment of software.
  • Mean Lead Time for Changes: Knowing when you’re deploying is a crucial element of establishing the overall development time required to get a change from ticket inception to final deployment into production. Using your new Jenkins data in Observability Cloud along with some additional signals from other software like Jira and Github, you’re well on your way to establishing a trackable flow from ticket, to development, and on to deployment.
  • Time to Recovery: The final piece of the DORA puzzle and directly related to its Observability focused Mean Time To Recovering (MTTR) cousin. Understanding Time to Recovery requires Jenkins metrics to know when the deployment went out with a breaking change and when the fix is finally deployed to production

Next Steps

Armed with your new Jenkins metrics and APM data, get out there and scrutinize pipelines, evaluate deployments, and generally push your DevOps Magic™ to the limit!

Want to quickly start understanding your Jenkins deployment? You can sign up to start a free trial of the Splunk Observability Cloud suite of products today!

This blog post was authored by Jeremy Hicks, Observability Field Solutions Engineer at Splunk with special thanks to: Doug Erkkila, Adam Schalock, Todd DeCapua, Tom Martin, Marie Duran, and Joel Schoenberg at Splunk.

Jeremy Hicks
Posted by

Jeremy Hicks

Jeremy Hicks is an observability evangelist and SRE veteran from multiple Fortune 500 E-commerce companies. His enthusiasm for monitoring, resiliency engineering, cloud, and DevOps practices provide a unique perspective on the observability landscape.