false

Perspectives Home / CISO CIRCLE

Beyond the Music: Lessons from Spotify's Fleet Management Journey

Learn how Spotify leverages standardized tooling, automated pipelines, and organizational alignment to transform challenges into opportunities for innovation.

Managing software at scale is a challenge that every growing organization must confront. Spotify, with its 675 million users, 100 million tracks, and operations in over 180 markets, has faced these challenges head-on. In this article, we’ll further explore Spotify’s journey to improve developer experience, leverage automation, and manage software at scale through fleet management. Their story offers useful lessons for companies aiming to streamline operations, cut costs, and reduce engineering toil, and it also clarifies the technical requirements necessary to achieve this level of automation.

 

The scale of the challenge 

Since its launch in 2008, Spotify has grown to 2,700+ engineers working from all parts of the globe. The team performs an astounding average of 4,100+ production deployments every single day. In 2024 alone, Spotify processed 652,000 automated pull requests across its repositories, managed more than 42,000 data pipelines and captured 1.6 trillion events per day — illustrating the sheer scale and velocity of its development efforts.

 

However, operating at this scale brings significant challenges. Imagine tackling common scenarios like:

 

  • Updating the Java runtime across all backend services 
  • Upgrading the internal service framework 

 

These types of migrations are critical but time-consuming, often spanning months. The burden of maintenance work — migrating dependencies, updating frameworks, patching vulnerabilities, and deprecating outdated technologies — was falling on individual teams, leading to fragmentation and inefficiencies.

 

The problem: Engineering toil and fragmentation

In its early years, Spotify’s adoption of autonomous development teams, called squads, that operated like startups allowed for rapid innovation. Squads made their own decisions, built their own software, and moved fast. However, as the company grew and each team maintained its own tools, frameworks and processes, this autonomy created a fragmented technology ecosystem.

 

After some years, the volume of software components grew fast and created a maintenance burden. Engineers faced toil — necessary but repetitive work that detracted from building new features. Spotify refers to this type of engineering toil as “the work that gets between me and what I want to do.”

 

The solution: A shift to fleet management

To address fragmentation, Spotify inverted the problem. Instead of squads handling maintenance individually, the platform organization automated it. The platform team adopted a new mantra: “We take the pain away, so our users (engineers) don’t have to perform this toil.”

 

This shift became a company objective and was guided by three core principles:

 

  1. Standardize the technology stack through the Golden Tech framework
  2. Help teams adopt Golden Tech to ensure consistency 
  3. Leverage automation to remove maintenance toil from engineering teams 

 

per-newsletter-promo-v3-380x253

Resilience starts with strategy

Stay in the know with executive insights on digital resilience, delivered straight to your inbox.

 

Standardizing through the Golden Tech framework

Standardizing technology involved a multifaceted approach. As part of the Golden Tech framework, Spotify built a tech radar tool defining a set of technology standards for everyone to build on, which they called golden technologies. Spotify also standardized how they build and deploy different types of software through golden paths, or self-paced guide books. To further enable and facilitate standardization, the team also implemented software templates within the Golden Tech framework that streamline creation of back-end services, pipelines, websites, or any other type of software.

 

Measuring progress and adoption

To ensure company-wide adoption of Golden Tech and unify its technology stack, Spotify developed a tool that measures the quality, reliability, and alignment of the software ecosystem. The tool incentivizes teams to reach certification levels by passing checks for their software, progressing toward the "golden state." The closer teams get, the more automated, free maintenance updates they receive.

 

Beyond individual teams, the tool provides department-wide visibility into overall tech health and certification levels. It also tracks their progress and how each department aligns with organizational goals.

 

This and all other tools — like the Spotify tech radar, software templates, and CI/CD plugins  — are built as plugins for Backstage. Backstage is Spotify’s developer portal which serves as a central hub for tools, creating a single pane of glass for platform engineering solutions. This interoperability unlocks compound value across the organization by simplifying the developer experience.

 

Fleet Shift: Automating large-scale code changes 

Once Spotify standardized and centralized its tech ecosystem, the next step was automating maintenance. Enter Fleet Shift, a tool designed to perform large-scale code changes across the fleet.

 

How Fleet Shift Works 

Fleet Shift automates updates across repositories by executing shifts — scripts that modify code at scale. Here’s how it works:

 

  • Engineers define a Shift, i.e. a set of instructions for code changes, and package it as a Docker container.
  • Fleet Shift executes the Shift as a Kubernetes job, cloning repositories, applying transformations, and creating pull requests.
  • Fleet Shift listens for errors or success. Automated checks ensure that changes are safe, and PRs can be merged automatically if CI/CD tests pass.

 

This automation has dramatically reduced migration timelines. For example, upgrading Spotify’s internal service framework used to take 200 days — Fleet Shift reduced this to less than seven days. 

 

Technical requirements for fleet management 

Achieving fleet management at scale requires a robust technical foundation that includes the following.

 

Declarative infrastructure

Spotify transitioned to declarative infrastructure to simplify configuration management across thousands of different services at scale.To be able to apply automated changes to the infrastructure, the configuration needed to be data (e.g. JSON, YAML) rather than code (e.g. TypeScript, HCL). Declarative infrastructure enables:

 

  • Automated updates across all services 
  • Consistency in provisioning and configuration 
  • Easier automation of infrastructure changes 

 

Version control and dependency management 

Managing dependencies across thousands of services requires:

 

  • Centralized version control and dependency management to track updates. An example of this is Spotify's BOM, which helps manage the versions of Spotify's most important dependencies.
  • Automated dependency upgrades to ensure that security patches and framework updates are applied seamlessly.
  • Golden Tech adoption, where standardized libraries and frameworks reduce fragmentation.

 

CI/CD at scale

With hundreds to thousands of migrations per year, Spotify reimagined its CI/CD platform to support:

 

  • Massive build and deployment volumes 
  • Automated testing pipelines to validate changes 
  • Continuous integration, ensuring services remain in a golden state

 

Cultural shift: Trusting automation

One of the biggest challenges wasn’t building the software — it was changing the culture. Engineers had to trust automated updates happening without their direct involvement. 

 

To make this work, Spotify emphasized strong engineering practices. Test coverage and automation are non-negotiable. Engineers must write tests to catch future issues, an internal principle known as the Beyoncé Rule: "If you liked it, then you should have put a test on it.”

 

Additionally, the infrastructure platform team had to shift from passive to active product ownership. Instead of merely introducing new platform capabilities, it now owns adoption and uses fleet management to roll out changes that delight the engineers who use their platform. 

 

The results: Automation at scale

Fleet Shift has authored over 1.8 million contributions, with a 3:1 ratio of bot contributions to human contributions, and climbing. 

 

Shifts range from simple version bumps to complex multi-repo transformations. For example, when Spotify discovered a critical security vulnerability in Log4j, Fleet Shift updated 80% of the fleet in less than 11 hours — a task that would have taken weeks or months manually. 

 

The future of fleet management at Spotify 

Spotify continues to refine fleet management, focusing on:

 

  • Cleaning up production environments by retiring experimental and deprecated software to reduce cloud costs and security risks 
  • Exploring monorepos and moving from a polyrepo world to a monorepo structure for better dependency management 
  • Leveraging LLMs and AI and researching how AI can further reduce engineering toil

 

Lessons for technology leaders 

Spotify’s journey offers useful insights for organizations managing software at scale, including: 

 

  • Standardization accelerates automation. Define a Golden Tech framework. 
  • Declarative infrastructure simplifies updates. Use data-based configurations. 
  • CI/CD must scale with automation. Optimize pipelines for frequent deployments. 
  • Test automation is non-negotiable. Ensure reliability across all changes. 
  • Cultural change is just as important as technical change. Be disciplined about product thinking to earn the engineers’ trust in automation.
  • Invest in developer experience. Tools that centralize resources and streamline workflows make it easier for engineers to adopt new standards and practices. A great developer experience accelerates adoption and reduces resistance to change.

 

Spotify’s fleet management journey is a testament to the power of standardization, automation, and cultural change in managing software at scale. By reducing engineering toil and empowering developers, you can build a foundation for long-term innovation.

 

If there’s one takeaway from Spotify’s story, it’s this: Automation isn’t just a tool — it’s a mindset. Embracing it can transform how your organization builds, maintains, and innovates at scale.

 

 

 

To learn more, check out the first article in this series, Fleet First for Better Developer Experience and Faster Software Delivery.

Read more Perspectives by Splunk

JANUARY 30, 2025  •  6 minute read

Fleet First for Better Developer Experience and Faster Software Delivery

 

Take a lesson in developer experience, automation, and managing software at scale from Google and Spotify.

APRIL 8, 2024  •  3 Minute Read

With Observability and AI, If Data Is the New Oil, What Is Its Pipeline?

 

As with oil, data is informational energy that must be found, extracted, refined, and transported to the location of consumption. Here's how it's done.

NOVEMBER 8, 2024  •  4 minute read

The Habits of Highly Successful Platform Engineering Teams

 

Discover how platform engineering drives developer productivity and IT efficiency.

Get more perspectives from security, IT and engineering leaders delivered straight to your inbox.