Modeling and Unifying DevOps Data Part 3: Pipelines

By Jeremy Hicks

How many software pipelines are required to keep your business running? How many of those build, test, release, or deploy code? The most likely answer here is “I don’t know…” followed by a set of mental gymnastics to try and determine a rough number that is most likely incorrect.

Digging deeper, how many of those pipelines failed in the last week, month, or year? How would you find that out? This question gets even more difficult to answer if your organization relies on pipelines in various different tools like Jenkins, GitHub Actions, GitLab CI, or one of the many cloud SaaS offerings. But (as you may have guessed by now in this series of blog posts) data models can help you unify that data and come to grips with your software pipelines! You can even start unifying that data today by leveraging the prop and transform configs for GitHub and GitLab available in our open source repository for DevOps mappings on GitHub!

This post is the third in a series of posts (including Part 1: Issues & Work and Part 2: Code) devoted to modeling DevOps data into a common set of mappings. Why is a common set of mappings important? Just take a look at the recommendations by the NSA for securing CI/CD systems regardless of tooling. If even government entities are taking unified DevOps data seriously, you know it’s an important issue searching for a solution. And you want to be at least as agile as the federal government, right?

Data models can help! Security teams have been using them for some time to great effect with offerings like the Splunk common information models for security. In this post, we’ll focus on the Software Pipelines and their commonalities across the Software Development Life Cycle (SDLC).

What Do Our Pipelines Have in Common?

Most organizations have a ton of different types of pipelines that may be running at any given time. These could be build and test pipelines running on a schedule, release pipelines that validate and package a new release, or even pipelines that manage infrastructure as code and deployment of new assets. These things all seem so different, but when we look at pipeline runs as a higher level object, it becomes obvious that they do have some important overlap.

Commonalities Between Software Pipeline Runs:

Every pipeline will have a pipeline id and/or pipeline name
Every pipeline run object will have:
A run/job/execution identifier such as run id
A signifier of pipeline status to know if it is started, stopped, running, completed, etc
A result such as successful, failed, etc
Times noting when the run started and completed
A repository name, and repository organization or project associated with the code it handles or interacts with
Every pipeline run object should have:
An associated latest commit hash for any code it is running, building, deploying, etc
A field noting who or what started the pipeline run

These commonalities are true of GitHub, GitLab, Jenkins, or really any software development related pipeline tool. By using these common linkages across pipeline data we can start unifying pipeline execution data like `run_number` (GitHub) or `job_id` (GitLab) into a single field like `run_id` for querying data from various pipeline tools. Similarly, regardless of how a given tool tracks the specific commit of code it is executing (or executing against), that commit hash can be aligned to `latest_commit_hash` and used as a linking field between other elements of the SDLC. With these two bits of information we can now see which `run_id` is related to an issue/ticket and associated code for a given `latest_commit_hash` by matching to the hashes associated with those issue and code objects (as described in Part 1 and Part 2 of this series, respectively).

Obviously this helps paint a picture of a unit of work from conception and ticket creation, through code commit/push/PR, and now build/test/release/deployment via pipeline. From the software developer and PM perspective this incredible data for demonstrating and measuring productivity. Additionally, this data enables organizations to start tracking the now popular DORA metrics of Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service!

^{Figure 1-1. An example dashboard leveraging unified pipeline data to establish DORA and other pipeline related metrics regardless of which tool the data came from.}

Using software pipeline data we can start to see “where the rubber meets the road” in terms of development efforts towards new features, added resiliency efforts, and other point in time changes to the greater software ecosystem of a given organization. Detailed and unified pipeline data is also invaluable during and after an incident as it helps to add context to the who, what, when, where, and why of what may have changed. Incident resolution also becomes much easier to track as pipeline data associated with code changes and rollbacks can be easily tracked from end to end. Even more impressively, by adding an attribute to monitoring metrics denoting the current running latest commit, the observability of a given code change can be tracked all the way into production. With that sort of data it becomes much easier to answer questions like “are we running any old instances that weren’t caught in the latest rollback or regression fix?” But we’ll get deeper into monitoring as this series of blog posts progresses!

Data Models (for Pipelines): An Embarrassment Of Riches!

The beautiful thing about a data model for pipelines is the immediate value of being able to quickly answer questions about pipelines and how they impact build, tests, release, and deployment of code. What sorts of questions? Let's take a look at some common ones:

Who is deploying our code? How much is automation versus manually triggered runs?
What was the latest pipeline run doing? What code was included in that pipeline run and what was the latest commit hash?
Where and how are we deploying our code? How often do deployments fail? How often are we manually deploying code or rolling back?
When did a specific feature or fix get deployed?
Was there any deployment of unexpected code or code from an unusual branch?
How long is it taking to go from merged to deployed code? Is that time improving?
How long is it taking for our pipelines to run? How can we improve that? Are there manual approvals or other long running steps involved?
What are we actually running in production, when was it deployed, and does that match with what we think is running in production?
“What is the status of each of our pipelines and who started each of those runs?

^{Figure 1-2. Quickly answer questions like “What is the latest status of each of our pipelines and who started each of those runs?” This simple SPL shows us Pipeline status across GitLab and GitHub in one place!}

But these aren’t the only questions by any means. Armed with pipeline data from any source, in combination with work and code data models, there are surprisingly few dev related questions that remain elusive!

I’m going to keep hammering this point: Linking the various SDLC components (planning to code, test, and release, even onward to monitoring) is perhaps the greatest value of a data model for DevOps! With so much available SDLC data and the ability to use fields like `issueNumber`, `commit` / `hash` and `run_id` it is possible to draw very clear lines from concept, to code, and deployment to production. The value for developer productivity metrics alone is huge! But DevOps data models also provide incredible value for incident and root cause analysis as well as enabling DORA metrics across the organization. This sort of data also enables Security or DevSecOps teams to better understand and track changes across the software ecosystem. Armed with unified pipeline data they can better determine when and how a vulnerability may have snuck into production and also when a fix was applied for that vulnerability.

Next Steps

Want to hear more? Interested in the next steps of a DevOps data model that helps unify the data between Jenkins, GitHub, GitLab and other deployment pipeline tools? Take a look at our public GitHub repository for DevOps mapping that includes props and transforms for GitHub and GitLab to get started unifying your DevOps data. More integrations will be added to that repository over time and we invite your contributions as well! But this isn’t the end of our journey — next time we’ll investigate the commonalities in observability monitoring data that can be used along with the DevOps data model to enhance your business resiliency, security, and observability all the way into production.

Interested in how Splunk does Data Models? You can sign up to start a free trial of Splunk Cloud Platform and start Splunking today!

This blog post was authored by Jeremy Hicks, Staff Observability Field Innovation Solutions Engineer at Splunk with special thanks to: Doug Erkkila, Chad Tripod, David Connett, and Todd DeCapua for collaborating, brainstorming, and helping to develop the concept of a DevOps data model.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.